In [4]:
import os

import pandas as pd
import tiktoken

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

In [5]:
INPUT_DIR = "./output/"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "community_reports"
ENTITY_TABLE = "entities"
COMMUNITY_TABLE = "communities"
RELATIONSHIP_TABLE = "relationships"
#COVARIATE_TABLE = "covariates"
TEXT_UNIT_TABLE = "text_units"
COMMUNITY_LEVEL = 2

In [6]:
# read nodes table to get community and degree data
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
community_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")

entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)

# load description embeddings to an in-memory lancedb vectorstore
# to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(
    collection_name="default-entity-description",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)

print(f"Entity count: {len(entity_df)}")
entity_df.head()

Entity count: 2687


Unnamed: 0,id,human_readable_id,title,type,description,text_unit_ids,frequency,degree,x,y
0,60a252ec-ecb4-466a-8e68-59768d761dd9,0,MICROSOFT FABRIC,PRODUCT EXPERIENCE,Microsoft Fabric is a comprehensive and unifie...,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...,22,180,0.0,0.0
1,44dc0a8c-b197-4353-9e22-f047f8d3ba50,1,ONE LAKE,DATA WAREHOUSE,One Lake is a comprehensive data storage and m...,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...,17,45,0.0,0.0
2,43f0f521-f8f8-42b1-919c-2f8866b36b4b,2,DATA ENGINEERING,ACTIVITY,Data Engineering encompasses a range of proces...,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...,3,32,0.0,0.0
3,eda3d13b-8f57-4625-8022-eaab1df9465b,3,DATA FACTORY,PRODUCT EXPERIENCE,Data Factory is a comprehensive product experi...,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...,37,125,0.0,0.0
4,6532609d-8690-4a56-b5a8-215a20cc6ef9,4,POWER BI,PRODUCT EXPERIENCE,Power BI is a comprehensive business analytics...,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...,58,77,0.0,0.0


In [7]:
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

print(f"Relationship count: {len(relationship_df)}")
relationship_df.head()

Relationship count: 3595


Unnamed: 0,id,human_readable_id,source,target,description,weight,combined_degree,text_unit_ids
0,4f779922-fdef-463c-b7c2-ccef6e6424e5,0,MICROSOFT FABRIC,ONE LAKE,Microsoft Fabric and OneLake are intricately c...,44.0,225,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...
1,4b67e599-d081-4673-b3a7-550e2d0bce4d,1,MICROSOFT FABRIC,DATA FACTORY,Microsoft Fabric is a comprehensive platform t...,46.0,305,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...
2,f36f2600-4e9c-4c45-b502-01afd384b058,2,MICROSOFT FABRIC,POWER BI,Microsoft Fabric and Power BI are closely inte...,31.0,257,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...
3,778b586d-4885-4a2d-929c-4ef69da73dae,3,MICROSOFT FABRIC,REAL-TIME INTELLIGENCE,Microsoft Fabric incorporates a feature known ...,17.0,215,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...
4,a8a6b5df-5ee7-4b09-9e3a-482510228e8e,4,MICROSOFT FABRIC,AZURE AI FOUNDRY,Azure AI Foundry provides AI capabilities that...,8.0,182,[422c343a682e7f78dae36ac94f2bf135ec09bd2784744...


In [8]:
# NOTE: covariates are turned off by default, because they generally need prompt tuning to be valuable
# Please see the GRAPHRAG_CLAIM_* settings
# covariate_df = pd.read_parquet(f"{INPUT_DIR}/{COVARIATE_TABLE}.parquet")

# claims = read_indexer_covariates(covariate_df)

# print(f"Claim records: {len(claims)}")
# covariates = {"claims": claims}

In [9]:
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)

print(f"Report records: {len(report_df)}")
report_df.head()

Report records: 458


Unnamed: 0,id,human_readable_id,community,level,parent,children,title,summary,full_content,rank,rating_explanation,findings,full_content_json,period,size
0,ba053084dcaa4945babebf54486fefb6,456,456,4,444,[],Data Factory and Microsoft Fabric Community,"The community centers around Data Factory, a c...",# Data Factory and Microsoft Fabric Community\...,9.0,The impact severity rating is high due to the ...,[{'explanation': 'Data Factory serves as the c...,"{\n ""title"": ""Data Factory and Microsoft Fa...",2025-05-22,49
1,01f3b549e43a407daa72a221f7b05e9a,457,457,4,444,[],On-Premises Data Gateway and Local Data Source,The community centers around the On-Premises D...,# On-Premises Data Gateway and Local Data Sour...,8.5,The impact severity rating is high due to the ...,[{'explanation': 'The On-Premises Data Gateway...,"{\n ""title"": ""On-Premises Data Gateway and ...",2025-05-22,2
2,a5ced721949e42b8b480394caee27ce1,436,436,3,255,[],Microsoft Fabric Community Dynamics,The Microsoft Fabric community is structured a...,# Microsoft Fabric Community Dynamics\n\nThe M...,9.0,The high impact severity rating reflects the c...,[{'explanation': 'The Activity Log is a crucia...,"{\n ""title"": ""Microsoft Fabric Community Dy...",2025-05-22,32
3,cf67628e7da24816b4e7a1552516d270,437,437,3,255,[],Fabric Activity Log and API Data,The community centers around the Fabric Activi...,# Fabric Activity Log and API Data\n\nThe comm...,8.5,The impact severity rating is high due to the ...,[{'explanation': 'The Fabric Activity Log serv...,"{\n ""title"": ""Fabric Activity Log and API D...",2025-05-22,2
4,519cc9ef5ae846aea0cd3e18141b7b8d,438,438,3,266,[],Microsoft Fabric Administration Community,The community centers around the administratio...,# Microsoft Fabric Administration Community\n\...,8.5,The impact severity rating is high due to the ...,[{'explanation': 'The Admin Portal serves as t...,"{\n ""title"": ""Microsoft Fabric Administrati...",2025-05-22,4


In [10]:
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

print(f"Text unit records: {len(text_unit_df)}")
text_unit_df.head()

Text unit records: 243


Unnamed: 0,id,human_readable_id,text,n_tokens,document_ids,entity_ids,relationship_ids,covariate_ids
0,422c343a682e7f78dae36ac94f2bf135ec09bd2784744d...,1,Tell us about your PDF experience.\nMicrosoft ...,1200,[d5835eb381e6b16dd4dd9c9b74c2bb54d06463d791e93...,"[60a252ec-ecb4-466a-8e68-59768d761dd9, 44dc0a8...","[4f779922-fdef-463c-b7c2-ccef6e6424e5, 4b67e59...",[]
1,7f05149428714a222f2c4fa84b8dc05f8991a7a9266cab...,2,"ises and in the cloud. For more information, s...",1200,[d5835eb381e6b16dd4dd9c9b74c2bb54d06463d791e93...,"[eda3d13b-8f57-4625-8022-eaab1df9465b, 8f00da4...","[4f730153-4a17-48e9-945e-cba4006f3aae, 8e1a1a6...",[]
2,406c2d7deeffa71c9bcd8922a425f6548d58eabd180b0d...,3,"IoT Hub, Azure SQL DB Change Data Capture (CD...",1200,[d5835eb381e6b16dd4dd9c9b74c2bb54d06463d791e93...,"[dd9567a4-0196-4e05-83cc-405418f0c7b1, f502cd2...",,[]
3,03b9935515464b3ba9fa660b98d2b7bd12a24ebe4f6494...,4,"For detailed instructions, see\nMoving your d...",1200,[d5835eb381e6b16dd4dd9c9b74c2bb54d06463d791e93...,"[6532609d-8690-4a56-b5a8-215a20cc6ef9, d73b962...","[be729764-c153-45f3-b8c6-23f08f5f68a3, d1c38ba...",[]
4,0c2447dd1987bf104efc25070271e6f085d4b9c2945d60...,5,", see Canceling, expiring, and closing.\nCance...",1200,[d5835eb381e6b16dd4dd9c9b74c2bb54d06463d791e93...,"[1e5a3679-d5fd-4c83-999b-8cafe1efdb73, 5e61555...","[43457dc8-f162-403c-847b-a52c1f439615, 17c177f...",[]


In [11]:
# feedback why should we pass the type to both config and the chat_model etc?

from graphrag.config.enums import ModelType, AuthType
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.language_model.manager import ModelManager
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

api_key = os.getenv("GRAPHRAG_API_KEY")
llm_model = os.getenv("GRAPHRAG_LLM_MODEL")
embedding_model = os.getenv("GRAPHRAG_EMBEDDING_MODEL")

chat_config = LanguageModelConfig(
    api_key=api_key,
    auth_type=AuthType.APIKey, 
    type=ModelType.AzureOpenAIChat,
    model=llm_model,
    deployment_name=llm_model,
    max_retries=20,
    api_base= os.getenv("GRAPHRAG_API_BASE"),
    api_version="2024-02-15-preview"
)
chat_model = ModelManager().get_or_create_chat_model(
    name="local_search",
    model_type=ModelType.AzureOpenAIChat,
    config=chat_config,
)

token_encoder = tiktoken.encoding_for_model(llm_model)

embedding_config = LanguageModelConfig(
    api_key=api_key,
    auth_type=AuthType.APIKey,
    type=ModelType.AzureOpenAIEmbedding,  # <-- Switch to AzureOpenAIEmbedding
    model=embedding_model,                # <-- This should be your Azure deployment name for embeddings
    deployment_name=embedding_model,      # <-- Same as above
    api_base=os.getenv("GRAPHRAG_API_BASE"),
    api_version="2024-02-15-preview"
)

text_embedder = ModelManager().get_or_create_embedding_model(
    name="local_search_embedding",
    model_type=ModelType.AzureOpenAIEmbedding,
    config=embedding_config,
)

In [12]:
llm_model

'gpt-4o'

In [13]:
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    # if you did not run covariates during indexing, set this to None
    #covariates=covariates,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

In [14]:
# text_unit_prop: proportion of context window dedicated to related text units
# community_prop: proportion of context window dedicated to community reports.
# The remaining proportion is dedicated to entities and relationships. Sum of text_unit_prop and community_prop should be <= 1
# conversation_history_max_turns: maximum number of turns to include in the conversation history.
# conversation_history_user_turns_only: if True, only include user queries in the conversation history.
# top_k_mapped_entities: number of related entities to retrieve from the entity description embedding store.
# top_k_relationships: control the number of out-of-network relationships to pull into the context window.
# include_entity_rank: if True, include the entity rank in the entity table in the context window. Default entity rank = node degree.
# include_relationship_weight: if True, include the relationship weight in the context window.
# include_community_rank: if True, include the community rank in the context window.
# return_candidate_context: if True, return a set of dataframes containing all candidate entity/relationship/covariate records that
# could be relevant. Note that not all of these records will be included in the context window. The "in_context" column in these
# dataframes indicates whether the record is included in the context window.
# max_tokens: maximum number of tokens to use for the context window.


local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

model_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}

In [15]:
search_engine = LocalSearch(
    model=chat_model,
    context_builder=context_builder,
    token_encoder=token_encoder,
    model_params=model_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

In [17]:
result = await search_engine.search("what is the difference between pipeline and dataflow?")
print(result.response)



### Overview

Pipelines and dataflows are both integral components of data processing and management within Microsoft Fabric, but they serve distinct purposes and functionalities. Understanding the differences between these two can help in effectively designing and managing data workflows.

### Pipelines

A pipeline is a comprehensive data processing workflow that encompasses a variety of activities aimed at managing data movement and transformation. It includes essential tasks such as copy activities, which facilitate the transfer of data, and other operations that allow users to effectively control data flows. Pipelines are crucial for orchestrating data processing tasks within technology environments, and they can encounter failures, particularly when attempting to connect to external services like Kusto for token retrieval [Data: Entities (594); Relationships (2777, 2600, 3043, 2779)].

Pipelines are created and managed within the Data Factory, highlighting their integral role in t

In [27]:
import asyncio
from IPython.display import display, clear_output, Markdown

async def stream_answer():
    output = ""  # Define here!
    async for chunk in search_engine.stream_search("what is the difference between pipeline and dataflow?"):
        for char in chunk:
            output += char
            clear_output(wait=True)
            display(Markdown(output))
            await asyncio.sleep(0.003)  # Simulate typing speed

await stream_answer()


To understand the difference between a pipeline and a dataflow, it's important to consider their roles and functionalities within data processing environments, particularly in the context of Microsoft Fabric.

### Pipelines

A pipeline is a comprehensive data processing workflow that encompasses a variety of activities aimed at managing data movement and transformation. It includes essential tasks such as copy activities, which facilitate the transfer of data, and other operations that allow users to effectively control data flows. Pipelines are integral components within data processing systems, orchestrating various data workflows and transformations. They are created and managed within platforms like Data Factory, highlighting their role in facilitating data integration and transformation processes [Data: Entities (594); Relationships (2777, 2600)].

Pipelines are susceptible to errors, particularly when attempting to connect to external services like Kusto for token retrieval, which may disrupt their functionality. This highlights the need for careful monitoring and management to ensure smooth data flow and integrity [Data: Entities (594); Relationships (2779)].

### Dataflows

Dataflows, on the other hand, are specific activities within data processing systems that allow users to design and execute data transformation processes. They are integral components of data pipelines, serving to define the pathways through which data is processed and transitioned across various stages of transformation and storage. This relationship underscores the essential role that dataflows play in the overall functionality of data pipelines, indicating that they are fundamental to the workflows involved in data processing [Data: Entities (474, 1405); Relationships (2133)].

Dataflows are designed to centralize data preparation logic, enabling the sharing of commonly used data tables among creators of semantic models. This approach significantly enhances data consistency and minimizes the frequency of refreshes on source systems. Additionally, dataflows facilitate the transformation and loading of data into a data model, thereby streamlining data integration and reporting efforts [Data: Entities (1405)].

### Conclusion

In summary, while both pipelines and dataflows are crucial for data processing, pipelines serve as the overarching framework that orchestrates various data workflows and transformations, whereas dataflows focus specifically on the transformation processes within these workflows. Pipelines manage the broader data movement and integration tasks, while dataflows handle the detailed transformation and preparation of data for further analysis and reporting.