# Using SPARQL with Knowledge Graphs for RAG

## Install Required Libraries

In [1]:
!pip install -q rdflib==7.0.0 langchain==0.3.0 langchain-community==0.3.0 langchain-openai==0.2.0 langchain-chroma==0.1.4 chromadb==0.5.3


[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Define Initial Question

In [2]:
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from typing import List

_ = load_dotenv()

In [3]:
question = "Give me the github link of Knowledge Graph Structure as Prompt"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
query_embedding_vector = embeddings.embed_query(question) 

## Load RDF Graph
Load the RDF graph from a Turtle file to initialize the graph structure.

In [4]:
from rdflib import Graph
from rdflib.plugins.sparql import prepareQuery
import pandas as pd

# Initialize RDF Graph
rdf_graph = Graph()

# Load the RDF graph from a fixed file (only done once)
rdf_graph.parse("global_graph.ttl", format="turtle")
print("RDF graph loaded successfully.")

RDF graph loaded successfully.


## SPARQL Query Execution Function
This function executes a given SPARQL query on the pre-loaded RDF graph and returns the results in a pandas DataFrame format.


In [5]:
def sparql_query(query: str) -> pd.DataFrame:
    """
    Executes a SPARQL query on a pre-loaded RDF graph and returns the results as a DataFrame.

    The function dynamically infers the column names from the query results.

    Parameters:
        query (str): The SPARQL query to execute.

    Returns:
        pd.DataFrame: A DataFrame containing the results of the SPARQL query.
    """
    try:
        # Prepare and execute the query        
        query = prepareQuery(query)
        results = rdf_graph.query(query)
        
        # Extract variable (column) names from the query result
        columns = results.vars  # Get the variable names from the query results
        
        # Process the results and convert them into a list of dictionaries
        data = []
        for row in results:
            row_data = {str(var): row[var] for var in columns}  # Dynamically build a row dict
            data.append(row_data)
        
        # Convert the data into a DataFrame
        df = pd.DataFrame(data, columns=[str(var) for var in columns])
        return df

    except Exception as e:
        print(f"An error occurred while executing the SPARQL query: {e}")
        return pd.DataFrame()

## Initialize vector store
Create a Chroma collection to store entities and their embeddings.

In [6]:
from langchain_chroma import Chroma
import chromadb
from tqdm import tqdm

# Initialize Chroma Persistent Client
persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection(name="graphrag_collection")

# Execute the SPARQL query and get results as a DataFrame
query = """
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX onto: <http://example.org/ontology#>
    
    SELECT ?entity ?name ?type ?description ?description_embedding
    WHERE {
        ?entity rdf:type onto:Entity .
        ?entity onto:hasName ?name .
        ?entity onto:hasType ?type .
        ?entity onto:hasDescription ?description .
        ?entity onto:hasDescriptionEmbedding ?description_embedding .
    }
"""
entities_df = sparql_query(query)

# Prepare data for Chroma collection
documents, metadatas, embeddings, ids = [], [], [], []

# Using tqdm for progress tracking
print("Processing entities and adding to Chroma collection...")
for _, row in tqdm(entities_df.iterrows(), total=len(entities_df), desc="Entities Processed"):
    documents.append(f"Name: {row['name']}, Type: {row['type']}, Description: {row['description']}")
    metadatas.append({
        'subject': str(row['entity']),
        'name': str(row['name']),
        'type': str(row['type']),
        'description': str(row['description'])
    })
    embeddings.append([float(x) for x in row['description_embedding'].split()])
    ids.append(str(row['entity']))

# Add the processed data to the Chroma collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    embeddings=embeddings,
    ids=ids
)
print("Chroma collection populated successfully.")

# Initialize vector store using Chroma
vector_store = Chroma(client=persistent_client, collection_name="graphrag_collection")

# Verify the count of entries in the collection
print(f"Total entries in the collection: {vector_store._collection.count()}")


Processing entities and adding to Chroma collection...


Entities Processed: 100%|██████████| 1140/1140 [00:02<00:00, 500.25it/s]
Add of existing embedding ID: http://example.org/data#Entity_007dbb8cddf248268127b52e600ada04
Add of existing embedding ID: http://example.org/data#Entity_008dce6bd1634d71bec32473e6e20afc
Add of existing embedding ID: http://example.org/data#Entity_00c41e15396746689f430e63db124072
Add of existing embedding ID: http://example.org/data#Entity_00d061cdff98409e966eff25bdde6468
Add of existing embedding ID: http://example.org/data#Entity_0127aafb317e4c7aa0ab2994c1ff0d47
Add of existing embedding ID: http://example.org/data#Entity_02846d01807c41628b8765a255ccc5e7
Add of existing embedding ID: http://example.org/data#Entity_045e3b05d6cf4a599fa85caeb7893c7e
Add of existing embedding ID: http://example.org/data#Entity_047fd5713ac5485fa2ac0be44c9f7e2c
Add of existing embedding ID: http://example.org/data#Entity_04b59bb8f41d4184952e841543f4c0e4
Add of existing embedding ID: http://example.org/data#Entity_058fa0a8f4f54b48ba00

Chroma collection populated successfully.
Total entries in the collection: 3725


## Local Search
### Entity-based Reasoning
The local search method uses both structured data from a knowledge graph and unstructured information from input documents to enhance the LLM's context during query processing. This method is particularly effective for questions involving detailed knowledge about specific entities, such as 'What are the healing properties of chamomile?'

Define the search limits for entities, chunks, communities, and relationships.

In [7]:
# adapt from https://github.com/ianormy/msft_graphrag_blog/blob/main/Part3/create_local_global.ipynb

TOP_ENTITIES = 10
TOP_CHUNKS = 10
TOP_COMMUNITIES = 3
TOP_OUTGOING_RELATIONSHIPS = 10
TOP_INCOMING_RELATIONSHIPS = 10

### Find the top 10 nearest Entities
Perform a k-nearest neighbor search using the vector store to identify the 10 most similar `Entity` instances based on the query embedding vector.

In [8]:
results = vector_store.similarity_search_by_vector(
    embedding=query_embedding_vector, k=TOP_ENTITIES
)

entity_list = [doc.metadata['subject'] for doc in results]
entity_list

['http://example.org/data#Entity_fbdd7a2514b346c4ba3d23f63084db3b',
 'http://example.org/data#Entity_f51482e3e86145a69bed79db525de5dd',
 'http://example.org/data#Entity_434f100addea47fc9e4f6d7cadba6374',
 'http://example.org/data#Entity_e0e9bca7e9694a92b35c435495aafa3d',
 'http://example.org/data#Entity_8a66c7d217004b9ab2d18da75a395ba7',
 'http://example.org/data#Entity_f3ed8363805f4b9e87c09bf5c737a478',
 'http://example.org/data#Entity_157bb3947232453cbe97799a093e4e12',
 'http://example.org/data#Entity_34fbaf58d70c4da0a622cfe0e535e95c',
 'http://example.org/data#Entity_d19ee7f7b13f42a48d7e13c7e0c79229',
 'http://example.org/data#Entity_5e991d98755446ec86ad8c2dfc689293']

In [9]:
def get_entities_query(entity_ids: List[str]) -> str:
    """
    Constructs a SPARQL query that retrieves entities based on a list of entity IDs.

    Parameters:
        entity_ids (list of str): A list of entity URIs to match.

    Returns:
        str: A SPARQL query string.
    """
    # Convert entity URIs into the SPARQL VALUES format
    values_clause = " ".join([f"<{entity_id}>" for entity_id in entity_ids])
    
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX onto: <http://example.org/ontology#>
    
    SELECT ?human_readable_id ?name ?type ?description
    WHERE {{
        ?entity rdf:type onto:Entity .
        ?entity onto:hasName ?name .
        ?entity onto:hasType ?type .
        ?entity onto:hasDescription ?description .        
        ?entity onto:hasHumanReadableId ?human_readable_id
                
        VALUES ?entity {{
            {values_clause}
        }}
    }}
    """
    
    return query

entities_df = sparql_query(get_entities_query(entity_list))
entities_df


Unnamed: 0,human_readable_id,name,type,description
0,370,KG,"DATA STRUCTURE, CONCEPT",Knowledge Graphs (KGs) are data structures tha...
1,62,KG STRUCTURE,"CONCEPT, DATA STRUCTURE",A knowledge graph structure used to represent ...
2,70,GRAPH CONTEXT C,"DATA STRUCTURE, CONTEXT",Context generated from knowledge graph structu...
3,4,KG STRUCTURE AS PROMPT,"TECHNIQUE, METHOD","KG STRUCTURE AS PROMPT"" is a novel approach de..."
4,183,KNOWLEDGE GRAPHS,"DATA STRUCTURE, CONCEPT",Knowledge Graphs are graph-based structures us...


In [10]:
def dataframe_to_text(df: pd.DataFrame, column_delimiter: str = "|", context_name: str = "") -> str:
    """
    Converts a pandas DataFrame into a formatted text string representation, with an optional context header.

    Parameters:
        df (pd.DataFrame): The DataFrame to be converted.
        column_delimiter (str): The delimiter used to separate columns in the text output. Defaults to '|'.
        context_name (str): An optional context header to include at the top of the text output.

    Returns:
        str: A string representation of the DataFrame, including headers and rows, separated by the specified delimiter.
    """
    if df.empty:
        return ""

    header_text = f"-----{context_name}-----\n" if context_name else ""
    header_text += column_delimiter.join(df.columns) + "\n"
    rows_text = "\n".join(column_delimiter.join(map(str, row)) for row in df.values)

    return header_text + rows_text

In [11]:
entity_text = dataframe_to_text(entities_df, context_name='Entities')

### Get The Top 10 Chunks(TextUnits)
Retrieve the `TextUnit` records associated with the specified `Entity` instances. Rank these `TextUnit` records based on their frequency of association with the entities, and return the top 10 results.

In [12]:
def get_textunits_by_entities(entity_ids: List[str], limit_chunks: int = 3) -> str:
    """
    Constructs a SPARQL query that retrieves the top TextUnits connected to specified Entity records.

    Parameters:
        entity_ids (list of str): A list of entity IDs.
        limit_chunks (int): The maximum number of TextUnits to return.

    Returns:
        str: A SPARQL query string.
    """
    # Convert entity URIs into the SPARQL VALUES format
    values_clause = " ".join([f"<{entity_id}>" for entity_id in entity_ids])
    
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX onto: <http://example.org/ontology#>
    PREFIX ex: <http://example.org/data#>

    SELECT ?textunit ?text (COUNT(?entity_uri) AS ?freq)
    WHERE {{
        
        VALUES ?entity_uri {{
            {values_clause}
        }}

        ?entity_uri rdf:type onto:Entity .

        ?textunit rdf:type onto:TextUnit ;
                  onto:hasText ?text ;
                  onto:referencesEntity ?entity_uri .
    }}
    GROUP BY ?textunit ?text
    ORDER BY DESC(?freq)
    LIMIT {limit_chunks}
    """
    
    return query

In [13]:
chunks_df = sparql_query(get_textunits_by_entities(entity_list, TOP_CHUNKS))
chunks_df

Unnamed: 0,textunit,text,freq
0,http://example.org/data#TextUnit_5102919b3ad27...,# Knowledge Graph Structure as Prompt: Improvi...,1
1,http://example.org/data#TextUnit_056de216afa6e...,and an open-\r\n---\r\nKG Structure as Prompt...,1
2,http://example.org/data#TextUnit_d3aa4f9c84bab...,content\r\nselection of the KG structures in ...,1
3,http://example.org/data#TextUnit_2f64be1ed9bb8...,We think the reason is\r\nthat ordered candid...,1
4,http://example.org/data#TextUnit_001576d5a1400...,the pair x and y as\r\ndefined in Eq. 3. Agai...,1
5,http://example.org/data#TextUnit_aeec53dfc2004...,softmax layer are further applied on top of t...,1
6,http://example.org/data#TextUnit_1a5bd10d2980f...,.\r\n3. We develop an evaluation framework and...,1
7,http://example.org/data#TextUnit_7c898119afbe1...,"domain-specific uses. On the one hand, it is ...",1
8,http://example.org/data#TextUnit_7f3a9f7fadb58...,"and Linguistic Computing 13(4), 177–186 (12 1...",1
9,http://example.org/data#TextUnit_9afd69f95819e...,"knowledge graphs. In: Proc. of AAAI. vol. 35,...",1


In [14]:
chunk_text = dataframe_to_text(chunks_df, context_name='Chunks')

### Get the Top 3 Communities
Get the top 3 `Community` records that are related to these `Entity` records, including their community reports.

In [15]:
def get_communities_with_reports_by_entities(entity_ids: List[str], limit_communities: int = 3) -> str:
    """
    Constructs a SPARQL query that retrieves Communities and their associated Community Reports
    connected to specified Entity records, ordered by the rank in the community report.

    Parameters:
        entity_ids (list of str): A list of entity IDs.
        limit_communities (int): The maximum number of Communities to return.

    Returns:
        str: A SPARQL query string.
    """
    # Convert entity URIs into the SPARQL VALUES format
    values_clause = " ".join([f"<{entity_id}>" for entity_id in entity_ids])    
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX onto: <http://example.org/ontology#>
    PREFIX ex: <http://example.org/data#>

    SELECT DISTINCT ?community ?report_title ?report_summary ?rank
    WHERE {{
        VALUES ?entity_uri {{
            {values_clause}
        }}

        ?entity_uri rdf:type onto:Entity ;
                    onto:isInCommunity ?community .

        ?community rdf:type onto:Community ;
                   onto:hasCommunityReport ?report .
                   
        ?report rdf:type onto:CommunityReport ;
                onto:hasTitle ?report_title ;
                onto:hasSummary ?report_summary ;
                onto:hasRank ?rank .                    
    }}    
    ORDER BY DESC(?rank)
    LIMIT {limit_communities}
    """
    
    return query

In [16]:
reports_df = sparql_query(get_communities_with_reports_by_entities(entity_list, limit_communities=TOP_COMMUNITIES))
reports_df

Unnamed: 0,community,report_title,report_summary,rank
0,http://example.org/data#Community_146,Knowledge Graphs and Large Language Models in ...,The community of algorithmic analysis is signi...,9.0
1,http://example.org/data#Community_7,Large Language Models and Their Role in Algori...,The community of Large Language Models (LLMs) ...,9.0
2,http://example.org/data#Community_1,Algorithmic Analysis and Knowledge Graphs,The community of algorithmic analysis is intri...,9.0


In [17]:
reports_text = dataframe_to_text(reports_df, context_name='Reports')

### Get Incoming and Outgoing Relationships for Entities
Find the incoming and outgoing relationships for the selected `Entity` instances.

In [18]:
def get_outgoing_relationships(entity_ids: List[str], limit_outgoing_relationships: int = 10) -> str:
    """
    Constructs a SPARQL query that retrieves outgoing relationships where the specified entities are the source.

    Parameters:
        entity_ids (list of str): A list of Entity IDs.
        limit_outgoing_relationships (int): The maximum number of relationships to return.

    Returns:
        str: A SPARQL query string.
    """
    # Convert entity URIs into the SPARQL VALUES format
    values_clause = " ".join([f"<{entity_id}>" for entity_id in entity_ids])
    
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX onto: <http://example.org/ontology#>
    PREFIX ex: <http://example.org/data#>

    SELECT ?human_readable_id ?source_name ?target_name ?description
    WHERE {{
        ?relationship rdf:type onto:Relationship ;
                      onto:hasHumanReadableId ?human_readable_id ;
                      onto:hasDescription ?description ;
                      onto:source ?source_entity ;
                      onto:target ?target_entity ;
                      onto:hasSourceName ?source_name ;
                      onto:hasTargetName ?target_name ;
                      onto:hasRank ?rank ;
                      onto:hasWeight ?weight .

        VALUES ?source_entity {{
                {values_clause}
            }}        
    }}
    ORDER BY DESC(?rank) DESC(?weight)
    LIMIT {limit_outgoing_relationships}
    """
    
    return query

In [19]:
outgoing_relationships_df = sparql_query(get_outgoing_relationships(entity_list, TOP_OUTGOING_RELATIONSHIPS))
outgoing_relationships_df

Unnamed: 0,human_readable_id,source_name,target_name,description
0,70,KG STRUCTURE AS PROMPT,KNOWLEDGE GRAPH (KG),The approach uses structural information from ...
1,358,KNOWLEDGE GRAPHS,LORA,LoRA can be applied to large language models t...
2,351,KNOWLEDGE GRAPHS,HITTER,HittER uses hierarchical transformers for know...
3,370,KNOWLEDGE GRAPHS,EVENT SEQUENCE GENERATION,Event sequence generation is used to replicate...
4,369,KNOWLEDGE GRAPHS,OL (ONTOLOGY LEARNING),Ontology Learning often involves extracting st...
5,360,KNOWLEDGE GRAPHS,BART,BART can be used for natural language generati...
6,353,KNOWLEDGE GRAPHS,MEM-KGC,MEM-KGC is a model for knowledge graph completion
7,354,KNOWLEDGE GRAPHS,CONVOLUTIONAL 2D KNOWLEDGE GRAPH EMBEDDINGS,KNOWLEDGE GRAPHS are structured representation...
8,368,KNOWLEDGE GRAPHS,DBPEDIA,"DBpedia is an example of a knowledge graph, re..."
9,355,KNOWLEDGE GRAPHS,COKG-QA,COKG-QA is a question answering system over CO...


In [20]:
def get_incoming_relationships(entity_ids: List[str], limit_incoming_relationships: int = 10) -> str:
    """
    Constructs a SPARQL query that retrieves incoming relationships where the specified entities are the target.

    Parameters:
        entity_ids (list of str): A list of Entity IDs.
        limit_incoming_relationships (int): The maximum number of relationships to return.

    Returns:
        str: A SPARQL query string.
    """
    # Convert entity URIs into the SPARQL VALUES format
    values_clause = " ".join([f"<{entity_id}>" for entity_id in entity_ids])
    
    query = f"""
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX onto: <http://example.org/ontology#>
    PREFIX ex: <http://example.org/data#>

    SELECT ?human_readable_id ?source_name ?target_name ?description
    WHERE {{
        ?relationship rdf:type onto:Relationship ;          
                      onto:hasHumanReadableId ?human_readable_id ;
                      onto:hasDescription ?description ;
                      onto:source ?source_entity ;
                      onto:target ?target_entity ;
                      onto:hasSourceName ?source_name ;
                      onto:hasTargetName ?target_name ;
                      onto:hasRank ?rank ;
                      onto:hasWeight ?weight .        

        VALUES ?target_entity {{
            {values_clause}
        }}
        
    }}
    ORDER BY DESC(?rank) DESC(?weight)
    LIMIT {limit_incoming_relationships}
    """
    
    return query

In [21]:
incoming_relationships_df = sparql_query(get_incoming_relationships(entity_list, TOP_INCOMING_RELATIONSHIPS))
incoming_relationships_df

Unnamed: 0,human_readable_id,source_name,target_name,description
0,461,DIFT,KG,DIFT acquires knowledge from the KG to improve...
1,137,WORDNET,KNOWLEDGE GRAPHS,WordNet can be used as a resource in the const...
2,181,WIKIDATA,KNOWLEDGE GRAPHS,Wikidata is often used as a source for buildin...
3,127,BERT,KNOWLEDGE GRAPHS,BERT can be used for language understanding in...
4,285,METAPATHS,KNOWLEDGE GRAPHS,Metapaths are used within knowledge graphs to ...
5,109,PROMPT-BASED LEARNING,GRAPH CONTEXT C,Graph context C is combined with input sequenc...
6,104,PROMPT-BASED LEARNING,KG STRUCTURE,KG structure is used as a prompt in prompt-bas...
7,61,SMALL LANGUAGE MODELS (SLMS),KG STRUCTURE AS PROMPT,"The ""KG Structure as Prompt"" approach signific..."
8,69,KNOWLEDGE-BASED CAUSAL DISCOVERY,KG STRUCTURE AS PROMPT,The KG Structure as Prompt approach was evalua...


In [22]:
outgoing_relationships_text = dataframe_to_text(outgoing_relationships_df, context_name='Relationships')
incoming_relationships_text = dataframe_to_text(incoming_relationships_df)
relationships_text = outgoing_relationships_text + incoming_relationships_text

### Create LangChain Response
Having got all our important data for our identified entity list, we now need to combine them to produce a response that would be suitable as a LangChain response.

In [23]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from tqdm import tqdm

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,    
)

### Prompt with no context
Prompt with no context fromt the Knowledge Graph

In [24]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that answers questions about academic papers.",
        ),
        ("human", "{input}"),
    ]
)
chain = prompt | llm | StrOutputParser()
chain.invoke(
    {
        "input": question,
    }
)

'The paper "Knowledge Graph Structure as Prompt" does not have a direct GitHub link provided in the information I have. To find the GitHub repository, if it exists, you can try the following steps:\n\n1. **Check the Paper**: Look at the paper itself, often in the introduction or conclusion sections, for any mention of supplementary materials or code repositories.\n\n2. **Search Online**: Use search engines with the paper\'s title along with "GitHub" to see if the authors have shared a link online.\n\n3. **Authors\' Profiles**: Visit the authors\' institutional or personal web pages, as they might host or link to the code there.\n\n4. **Research Repositories**: Websites like Papers with Code often list implementations of papers and might have a link to the GitHub repository if it exists.\n\nIf you have any more specific details or need further assistance, feel free to ask!'

### Prompt with context retrieved from Knowledge Graph
Create a context using the Knowledge Graph and feed that to the LLM.

In [25]:
# Source: https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/local_search/system_prompt.py

LOCAL_SEARCH_SYSTEM_PROMPT = """
---Role---

You are a helpful assistant responding to questions about data in the tables provided.


---Goal---

Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.

If you don't know the answer, just say so. Do not make anything up.

Points supported by data should list their data references as follows:

"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."

Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."

where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.


---Target response length and format---

{response_type}


---Data tables---

{context_data}


---Goal---

Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.

If you don't know the answer, just say so. Do not make anything up.

Points supported by data should list their data references as follows:

"This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."

Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."

where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.


---Target response length and format---

{response_type}

Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""

In [26]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            LOCAL_SEARCH_SYSTEM_PROMPT,
        ),
        (
            "human",
            "{question}",
        ),
    ]
)

chain = prompt | llm | StrOutputParser()
chain.invoke(
    {
        "response_type": "single paragraph",
        "context_data": entity_text + "\n" + chunk_text + "\n" + relationships_text +"\n" + reports_text,
        "question": question,
    }
)

'The GitHub link for the "Knowledge Graph Structure as Prompt" approach is available at: [https://github.com/littleflow3r/kg-structure-as-prompt](https://github.com/littleflow3r/kg-structure-as-prompt) [Data: Chunks (5102919b3ad27254673948bab8a72ed8)].'

## Global Search

### Whole Dataset Reasoning
The global search method aggregates information from community reports in the knowledge graph to summarize themes and provide comprehensive answers. It works in a map-reduce manner: first, breaking down reports into smaller parts to generate intermediate responses (map step), and then combining the key points to form a final summary (reduce step). This approach is ideal for queries requiring an overview of the dataset, such as "What are the top 5 themes in the data?"

In [27]:
# Source: https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/global_search/map_system_prompt.py

MAP_SYSTEM_PROMPT = """
---Role---

You are a helpful assistant responding to questions about data in the tables provided.


---Goal---

Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.

You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.

Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.

The response should be JSON formatted as follows:
{{
    "points": [
        {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
        {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
    ]
}}

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.

Do not include information where the supporting evidence for it is not provided.


---Data tables---

{context_data}

---Goal---

Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.

You should use the data provided in the data tables below as the primary context for generating the response.
If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.

Each key point in the response should have the following element:
- Description: A comprehensive description of the point.
- Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

Points supported by data should list the relevant reports as references as follows:
"This is an example sentence supported by data references [Data: Reports (report ids)]"

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:
"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.

Do not include information where the supporting evidence for it is not provided.

The response should be JSON formatted as follows:
{{
    "points": [
        {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
        {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
    ]
}}
"""
map_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            MAP_SYSTEM_PROMPT,
        ),
        (
            "human",
            "{question}",
        ),
    ]
)
map_chain = map_prompt | llm | StrOutputParser()

In [28]:
REDUCE_SYSTEM_PROMPT = """
---Role---

You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts.


---Goal---

Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.

Note that the analysts' reports provided below are ranked in the **descending order of importance**.

If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.

The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.

Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.


---Target response length and format---

{response_type}


---Analyst Reports---

{report_data}


---Goal---

Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.

Note that the analysts' reports provided below are ranked in the **descending order of importance**.

If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.

The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.

The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".

The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.

**Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.

For example:

"Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"

where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.

Do not include information where the supporting evidence for it is not provided.


---Target response length and format---

{response_type}

Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
"""

reduce_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            REDUCE_SYSTEM_PROMPT,
        ),
        (
            "human",
            "{question}",
        ),
    ]
)
reduce_chain = reduce_prompt | llm | StrOutputParser()

In [29]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

def global_retriever(query: str, level: int, response_type: str = "multiple paragraphs") -> str:
    """Global retriever

    :param query: the question as string
    :param level: the Community level
    :param response_type: type of response
    :returns: final response as a string
    """
    
    community_query = f"""
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX onto: <http://example.org/ontology#>
        PREFIX ex: <http://example.org/data#>

        SELECT ?full_content
        WHERE {{
            ?community rdf:type onto:Community ;
                    onto:hasCommunityReport ?report .
                    
            ?report rdf:type onto:CommunityReport ;
                    onto:hasFullContent ?full_content ;
                    onto:hasLevel ?level .

            FILTER(?level <= {level})   
        }}        
        """    
    reports_df = sparql_query(community_query)
    intermediate_results = []
    
    # for i in tqdm(range(len(reports_df)), desc="Processing communities"):        
    #     intermediate_response = map_chain.invoke(
    #         {"question": query, "context_data": reports_df["full_content"].iloc[i]}
    #     )
    #     intermediate_results.append(intermediate_response)

    def process_community(content):
        return map_chain.invoke({"question": query, "context_data": content})

    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(process_community, reports_df["full_content"].iloc[i]) for i in range(len(reports_df))]
        for future in tqdm(as_completed(futures), total=len(futures), desc="Processing communities"):
            intermediate_results.append(future.result())

    final_response = reduce_chain.invoke(
        {
            "report_data": intermediate_results,
            "question": query,
            "response_type": response_type,
        }
    )
    return final_response

In [30]:
print(global_retriever("What are the main themes present in all these papers?", 0))

Processing communities: 100%|██████████| 25/25 [00:30<00:00,  1.23s/it]


The main themes present in these papers revolve around advanced methodologies in algorithmic analysis, machine learning, and natural language processing. These themes are crucial for advancing the field and are prominently featured in conferences like ICML and EMNLP [Data: Reports (605, 847, 845, 846, 393, 543, 691, 692)].

### Algorithmic Analysis and Machine Learning

A significant focus is on complex embeddings and subgraph reasoning, which are essential for tasks like link prediction and inductive relation prediction. These methodologies improve data retrieval and processing, enhancing the understanding of complex data structures [Data: Reports (605, 847, 845, 846, 650, 865)]. The integration of knowledge graphs in causal discovery and the use of models like GraIL and CoMPILE for inductive reasoning are also highlighted, showcasing the importance of structured data representations in algorithmic research [Data: Reports (511, 522, 523, 572, 571, 578, 577, 576)].

### Natural Languag