Constructing and Retriving Information From KnowledgeGraph using Langchain

# 1.Introduction

* Graph Retrieval Augemented Generation[Graph RAG](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/?ref=blog.langchain.dev) is gaining momentum and emerging as powerful additional traditional vector search retrieval methods.
* This approach leverages the structured nature of graph databases, which organize data as nodoe and relationships, to enhance the depth and cotnexuality retrieved information.

![](https://miro.medium.com/v2/resize:fit:700/0*rm_fRSPovV1wfTqH.jpg)

* Graphs are great at representing and storing heterogeneous and interconnected information in a structured manner, effortlessly capturing complex relationships and attributes across diverse data types.
* In contrast, vector databases often struggle with such structured information, as their strength lies in handling unstructured data through high-dimensional vectors.
* In your RAG application, you can combine structured graph data with vector search through unstructured text to achieve the best of both worlds

# 2.Knowledge graphs are great, but how do you create one?

* Constructing a knowledge graph is typically the most challenging step in leveraging the power of graph-based data representation.
* It involves gathering and structuring the data, which requires a deep understanding of both the domain and graph modeling. 
* To simplify this process, we have been experimenting with LLMs. LLMs, with their profound understanding of language and context, can automate significant parts of the knowledge graph creation process. 
* By analyzing text data, these models can identify entities, understand the relationships between them, and suggest how they might be best represented in a graph structure. 

# 3.Installation of Packages

In [None]:
# %pip install --upgrade --quiet  langchain langchain-community langchain-openai langchain-experimental neo4j wikipedia tiktoken yfiles_jupyter_graphs

# 4.Import Required Packages

In [14]:
from langchain_core.runnables import (
    RunnableBranch,
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Tuple, List, Optional
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
import os
from langchain_community.graphs import Neo4jGraph
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import TokenTextSplitter
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
from neo4j import GraphDatabase
from yfiles_jupyter_graphs import GraphWidget
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.neo4j_vector import remove_lucene_chars
from langchain_core.runnables import ConfigurableField, RunnableParallel, RunnablePassthrough

# 5.Configue the Neo4j Environment Setup:

* You need to set up a Neo4j instance follow along with the examples in this blog post. The easiest way is to start a free instance on [Neo4j Aura](https://neo4j.com/cloud/platform/aura-graph-database/), which offers cloud instances of Neo4j database. Alternatively, you can also set up a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.

* **Note : You can also connect your local database as well I am using Local Neo4j DB to connect.**


In [2]:
os.environ["OPENAI_API_KEY"] = "sk-"
os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "admin@123"

graph = Neo4jGraph()

# 6.DataInjection Pipeline

- we will use LLMs I's Wikipedia page. We can utilize LangChain loaders to fetch and split the documents from Wikipedia seamlessly.

In [3]:
# Read the wikipedia
raw_documents = WikipediaLoader(query="Large Language Model").load()

# Define chunking Strategy
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap = 24)
documents = text_splitter.split_documents(raw_documents[:3])

In [5]:
documents

[Document(page_content="India, officially the Republic of India (ISO: Bhārat Gaṇarājya), is a country in South Asia.  It is the seventh-largest country by area; the most populous country as of June 2023; and from the time of its independence in 1947, the world's most populous democracy. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.\nModern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.\nTheir long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity. Settled life emerged on the su

Now it's time to construct a graph based on the retrieved documents. For this purpose, we have implemented an `LLMGraphTransformermodule` that significantly simplifies constructing and storing a knowledge graph in a graph database.

In [7]:
llm = ChatOpenAI(temperature=0,model_name = "gpt-3.5-turbo-0125")
llm_transformer = LLMGraphTransformer(llm=llm)

graph_documents = llm_transformer.convert_to_graph_documents(documents)
graph.add_graph_documents(graph_documents, baseEntityLabel=True, include_source=True)

- **LLM Selection and Support:**
  - You can define which LLM you want the knowledge graph generation chain to use.
  - Currently, we support only function calling models from OpenAI and Mistral.
  - Expansion Plans: We plan to expand the LLM selection in the future.
  - Example Model: In this example, we are using the latest GPT-4.
  - Quality Dependency: Note that the quality of generated graph significantly depends on the model you are using.
  - Optimal Model Usage: In theory, you always want to use the most capable one.
- **Graph Generation Process:**
  - The LLM graph transformers return graph documents.
  - Integration with Neo4j: These documents can be imported to Neo4j via the add_graph_documents method.
  - BaseEntityLabel Parameter: It assigns an additional __Entity__ label to each node, enhancing indexing and query performance.
  - Include Source Parameter: It links nodes to their originating documents, facilitating data traceability and context understanding.
- **Graph Inspection:**
  - Visualization Tool: You can inspect the generated graph with yFiles visualization.

In [8]:
# ditectly show the graph resulting from  the given cypher query
default_cypher = "MATCH (s)-[r:!MENTIONS]->(t) RETURN s,r,t LIMIT 50"

def showGraph(cypher:str = default_cypher):
    # create a neo4j session to run queries
    driver = GraphDatabase.driver(
        uri = os.environ["NEO4J_URI"],
        auth = (os.environ["NEO4J_USERNAME"],os.environ["NEO4J_PASSWORD"]))
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = "id"
    # display(widget)
    return widget

showGraph()                               
    

GraphWidget(layout=Layout(height='800px', width='100%'))

# 7.Hybrid Retrieval for RAG:
  - After graph generation, a hybrid retrieval approach is utilized for RAG (Retrieval-Augmented Generation) 
  applications.  
![](https://miro.medium.com/v2/resize:fit:700/1*TJJBOZN9auUioEnqQo-Qdw.png)
  - **Retrieval Process:**
    - User Interaction: The process begins with a user posing a question.
    - RAG Retriever: The question is directed to an RAG retriever.
    - Retrieval Techniques: This retriever employs keyword and vector searches to sift through unstructured text data.
    - Graph Integration: It combines retrieved information with data from the knowledge graph.
    - Single Database System: Since Neo4j supports both keyword and vector indexes, all three retrieval options can be implemented using a single database system.
    - Final Answer Generation: The collected data from these sources is fed into an LLM to generate and deliver the final answer.
- **Unstructured Data Retriever:**
  - **Neo4jVector.from_existing_graph Method:**
    - Functionality: You can utilize this method to add both keyword and vector retrieval to documents.
    - Configuration: It configures keyword and vector search indexes for a hybrid search approach.
    - Target Nodes: This method targets nodes labeled Document.
    - Additional Functionality: It calculates text embedding values if they are missing.

In [9]:
vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    search_type="hybrid",
    node_label="Document",
    text_node_properties=["text"],
    embedding_node_property="embedding"
)

- **Vector Index Utilization:**
  - **Similarity_Search Method:**
    - Usage: The vector index can be called with the similarity_search method.

# 8.Graph Retriever:
  - **Configuration and Process:**
    - Involvement and Flexibility: Configuring a graph retrieval is more involved but offers more freedom.
    - Methodology: It utilizes a full-text index to identify relevant nodes and then returns their direct neighborhood.
  - **Example and Visualization:**
    - Example Diagram: 
    
  ![graph](https://raw.githubusercontent.com/tomasonjo/blogs/master/neighbor.png)
  - **Retriever Workflow:**
    - Identification of Relevant Entities: The graph retriever starts by identifying relevant entities in the input.
    - Structured Output Instruction: For simplicity, the LLM is instructed to identify people, organizations, and locations.
  - **Utilization of LCEL:**
    - Method: To achieve this, LCEL is used with the newly added `with_structured_output` method.

In [15]:
# Retriever
graph.query(
    "CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]")

# Extract entities from text
class Entities(BaseModel):
    """Identifying information about entities."""

    names: List[str] = Field(
        ...,
        description="All the person, organization, or business entities that "
        "appear in the text",
    )

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are extracting organization and person entities from the text.",),
        ("human","Use the given format to extract information from the following input: {question}",),
    ]
)

entity_chain = prompt | llm.with_structured_output(Entities)

  warn_beta(


In [18]:
entity_chain.invoke({"question": "India is located in south asia?"}).names

['India', 'south asia']

# 9.Full Knowledge Graph

Great, now that we can detect entities in the question, let's use a full-text index to map them to the knowledge graph. First, we need to define a full-text index and a function that will generate full-text queries that allow a bit of misspelling, which we won't go into much detail here.

In [21]:
def generate_full_text_query(input: str) -> str:
    """
    Generate a full-text search query for a given input string.

    This function constructs a query string suitable for a full-text search.
    It processes the input string by splitting it into words and appending a
    similarity threshold (~2 changed characters) to each word, then combines
    them using the AND operator. Useful for mapping entities from user questions
    to database values, and allows for some misspelings.
    """
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"
    return full_text_query.strip()

# Fulltext index query
def structured_retriever(question: str) -> str:
    """
    Collects the neighborhood of entities mentioned
    in the question
    """
    result = ""
    entities = entity_chain.invoke({"question": question})
    for entity in entities.names:
        response = graph.query(
            """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2})
            YIELD node,score
            CALL {
              MATCH (node)-[r:!MENTIONS]->(neighbor)
              RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
              UNION
              MATCH (node)<-[r:!MENTIONS]-(neighbor)
              RETURN neighbor.id + ' - ' + type(r) + ' -> ' +  node.id AS output
            }
            RETURN output LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )
        result += "\n".join([el['output'] for el in response])
    return result

In [24]:
print(structured_retriever("India is located in south asia?"))

Large Language Model - TASK -> General-Purpose Language Generation
Large Language Model - TASK -> Natural Language Processing
Large Language Model - TASK -> Classification
Large Language Model - MODEL -> Artificial Neural Networks
Large Language Model - MODEL -> Decoder-Only Transformer-Based Architecture
Large Language Model - MODEL -> Recurrent Neural Network Variants
Large Language Model - MODEL -> Mamba
Large Language Model - METHOD -> Fine Tuning
Large Language Model - MODEL -> Gpt-3
Large Language Model - COMPONENT -> Syntax
Large Language Model - COMPONENT -> Semantics
Large Language Model - COMPONENT -> Ontology
Large Language Model - COMPONENT -> Inaccuracies
Large Language Model - COMPONENT -> Biases
Large Language Model - COMPANY -> Openai
Large Language Model - MODEL -> Chatgpt
Large Language Model - MODEL -> Microsoft Copilot
Large Language Model - COMPANY -> Google
Large Language Model - MODEL -> Palm
Large Language Model - MODEL -> Gemini
Neurips Conference - CONFERENCE 

# 10.Final retriever
As we mentioned at the start, we'll combine the unstructured and graph retriever to create the final context that will be passed to an LLM.

In [26]:
def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
    {structured_data}
    Unstructured data:
    {"#Document ". join(unstructured_data)}
    """
    return final_data

As we are dealing with Python, we can simply concatenate the outputs using the f-string.
# 11.Defining the RAG chain
We have successfully implemented the retrieval component of the RAG. First, we will introduce the query rewriting part that allows conversational follow up questions.


In [27]:
# Condense a chat history and follow-up question into a standalone question
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question,
in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""  # noqa: E501
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer

_search_query = RunnableBranch(
    # If input includes chat_history, we condense it with the follow-up question
    (
        RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
            run_name="HasChatHistoryCheck"
        ),  # Condense follow-up question and chat into a standalone_question
        RunnablePassthrough.assign(
            chat_history=lambda x: _format_chat_history(x["chat_history"])
        )
        | CONDENSE_QUESTION_PROMPT
        | ChatOpenAI(temperature=0)
        | StrOutputParser(),
    ),
    # Else, we have no chat history, so just pass through the question
    RunnableLambda(lambda x : x["question"]),
)

In [28]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
Use natural language and be concise.
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

In [29]:
chain.invoke({"question": "What is large language model?"})

Search query: What is large language model?


'A large language model is an artificial neural network used for general-purpose language generation, natural language processing, and classification tasks. It learns statistical relationships from text documents during training to generate text and understand syntax, semantics, ontology, inaccuracies, and biases in human language.'

In [30]:
chain.invoke(
    {
        "question": "What is the name of first large language model?",
        "chat_history": [("What is large language model?", "A large language model is an artificial neural network used for general-purpose language generation, natural language processing, and classification tasks. It learns statistical relationships from text documents during training to generate text and understand syntax, semantics, ontology, inaccuracies, and biases in human language.")],
    }
)

Search query: What is the name of the first large language model?


'The name of the first large language model is GPT-1.'