In [1]:
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License.

In [2]:
import logging

logging.basicConfig(level=logging.INFO, format="%(message)s")
logging.getLogger("httpx").setLevel(logging.WARNING)

In [3]:
import os

import pandas as pd
import tiktoken

from graphrag.query.indexer_adapters import (
    read_indexer_communities,
    read_indexer_entities,
    read_indexer_reports,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.structured_search.global_search.community_context import (
    GlobalCommunityContext,
)
from graphrag.query.structured_search.global_search.search import GlobalSearch

Found module: datashaper.engine.verbs.aggregate
Found module: datashaper.engine.verbs.bin
Found module: datashaper.engine.verbs.binarize
Found module: datashaper.engine.verbs.boolean
Found module: datashaper.engine.verbs.concat
Found module: datashaper.engine.verbs.convert
Found module: datashaper.engine.verbs.copy
Found module: datashaper.engine.verbs.dedupe
Found module: datashaper.engine.verbs.derive
Found module: datashaper.engine.verbs.destructure
Found module: datashaper.engine.verbs.difference
Found module: datashaper.engine.verbs.drop
Found module: datashaper.engine.verbs.erase
Found module: datashaper.engine.verbs.fill
Found module: datashaper.engine.verbs.filter
Found module: datashaper.engine.verbs.fold
Found module: datashaper.engine.verbs.groupby
Found module: datashaper.engine.verbs.impute
Found module: datashaper.engine.verbs.intersect
Found module: datashaper.engine.verbs.join
Found module: datashaper.engine.verbs.lookup
Found module: datashaper.engine.verbs.merge
Found

## Global Search example

Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole (e.g. What are the most significant values of the herbs mentioned in this notebook?).

We will first demonstrate how to run global search with fixed community level, followed by dynamic community selection.

### LLM setup

In [4]:
api_key = os.environ["GRAPHRAG_LLM_API_KEY"]
api_base = os.environ["GRAPHRAG_LLM_API_BASE"]
api_version = os.environ["GRAPHRAG_LLM_API_VERSION"]
model = os.environ["GRAPHRAG_LLM_MODEL"]

llm = ChatOpenAI(
    api_key=api_key,
    model=model,
    deployment_name=model,
    api_base=api_base,
    api_version=api_version,
    api_type=OpenaiApiType.AzureOpenAI,  # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI
    max_retries=20,
)
token_encoder = tiktoken.encoding_for_model(llm.model)

### Load community reports as context for global search

- Load all community reports in the `create_final_community_reports` table from the ire-indexing engine, to be used as context data for global search.
- Load entities from the `create_final_nodes` and `create_final_entities` tables from the ire-indexing engine, to be used for calculating community weights for context ranking. Note that this is optional (if no entities are provided, we will not calculate community weights and only use the `rank` attribute in the community reports table for context ranking)
- Load all communities in the `create_final_communites` table from the ire-indexing engine, to be used to reconstruct the community graph hierarchy for dynamic community selection.

In [5]:
# parquet files generated from indexing pipeline
INPUT_DIR = "./inputs/operation dulce"
COMMUNITY_TABLE = "create_final_communities"
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"

community_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

report_df.head()

Unnamed: 0,community,full_content,level,rank,title,rank_explanation,summary,findings,full_content_json,id
0,10,# Paranormal Military Squad at Dulce Base: Dec...,1,8.5,Paranormal Military Squad at Dulce Base: Decod...,The impact severity rating is high due to the ...,"The Paranormal Military Squad, stationed at Du...",[{'explanation': 'Jordan is a central figure i...,"{\n ""title"": ""Paranormal Military Squad at ...",1ba2d200-dd26-4693-affe-a5539d0a0e0d
1,11,# Dulce and Paranormal Military Squad Operatio...,1,8.5,Dulce and Paranormal Military Squad Operations,The impact severity rating is high due to the ...,"The community centers around Dulce, a secretiv...",[{'explanation': 'Dulce is described as a top-...,"{\n ""title"": ""Dulce and Paranormal Military...",a8a530b0-ae6b-44ea-b11c-9f70d138298d
2,12,# Paranormal Military Squad and Dulce Base Ope...,1,7.5,Paranormal Military Squad and Dulce Base Opera...,The impact severity rating is relatively high ...,The community centers around the Paranormal Mi...,[{'explanation': 'Taylor is a central figure w...,"{\n ""title"": ""Paranormal Military Squad and...",0478975b-c805-4cc1-b746-82f3e689e2f3
3,13,# Mission Dynamics and Leadership: Cruz and Wa...,1,7.5,Mission Dynamics and Leadership: Cruz and Wash...,The impact severity rating is relatively high ...,This report explores the intricate dynamics of...,[{'explanation': 'Cruz is a central figure in ...,"{\n ""title"": ""Mission Dynamics and Leadersh...",b56f6e68-3951-4f07-8760-63700944a375
4,14,# Dulce Base and Paranormal Military Squad: Br...,1,8.5,Dulce Base and Paranormal Military Squad: Brid...,The impact severity rating is high due to the ...,"The community centers around the Dulce Base, a...","[{'explanation': 'Sam Rivera, a member of the ...","{\n ""title"": ""Dulce Base and Paranormal Mil...",736e7006-d050-4abb-a122-00febf3f540f


### GlobalSearch class setting

The same settings are used in both fixed and dynamic community selection

In [6]:
context_builder_params = {
    "use_community_summary": False,  # False means using full community reports. True means using community short summaries.
    "shuffle_data": True,
    "include_community_rank": True,
    "min_community_rank": 0,
    "community_rank_name": "rank",
    "include_community_weight": True,
    "community_weight_name": "occurrence weight",
    "normalize_community_weight": True,
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    "context_name": "Reports",
}

map_llm_params = {
    "max_tokens": 1000,
    "temperature": 0.0,
    "response_format": {"type": "json_object"},
}

reduce_llm_params = {
    "max_tokens": 2000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500)
    "temperature": 0.0,
}

global_search_params = {
    "llm": llm,
    "token_encoder": token_encoder,
    "max_data_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    "map_llm_params": map_llm_params,
    "reduce_llm_params": reduce_llm_params,
    "allow_general_knowledge": False,  # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases.
    "json_mode": True,  # set this to False if your LLM model does not support JSON mode.
    "context_builder_params": context_builder_params,
    "concurrent_coroutines": 32,
    "response_type": "multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
}

### Global search with fixed community selection 

Community level in the Leiden community hierarchy from which we will load the community reports.

A higher value means we use reports from more fine-grained communities (at the cost of higher computation cost).

In [7]:
# community level in the Leiden community hierarchy from which we will load the community reports
# higher value means we use reports from more fine-grained communities (at the cost of higher computation cost)
COMMUNITY_LEVEL, DYNAMIC_SELECTION = 2, False

communities = read_indexer_communities(community_df, entity_df, report_df)
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL, DYNAMIC_SELECTION)
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

print(f"Total report count: {len(report_df)}")
print(
    f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
)

filter under community level 2
roll up community level


Total report count: 20
Report count after filtering by community level 2: 17


#### Build global context based on community reports

In [8]:
fixed_context_builder = GlobalCommunityContext(
    community_reports=reports,
    communities=communities,
    llm=llm,
    token_encoder=token_encoder,
    entities=entities,  # default to None if you don't want to use community weights for ranking
    dynamic_selection=DYNAMIC_SELECTION,
)

#### Perform global search with fixed communities

In [9]:
fixed_search_engine = GlobalSearch(
    context_builder=fixed_context_builder,
    **global_search_params,
)

In [10]:
QUESTION = "What is Cosmic Vocalization and who are involved in it?"

result = await fixed_search_engine.asearch(QUESTION)

print(result.response)

Computing community weights...
Map response: {
    "points": [
        {
            "description": "Cosmic Vocalization is a central focus of the community, drawing attention from various individuals and groups. It is perceived as part of an interstellar duet by Alex Mercer, suggesting a responsive approach to cosmic events [Data: Reports (6)].",
            "score": 85
        },
        {
            "description": "Taylor Cruz raises concerns about Cosmic Vocalization, fearing it might be a homing tune, which adds a layer of urgency and potential threat to the phenomenon [Data: Reports (6)].",
            "score": 80
        },
        {
            "description": "The Paranormal Military Squad is engaged in strategic responses to Cosmic Vocalization as part of their mission, highlighting its significance in security measures [Data: Reports (6)].",
            "score": 75
        },
        {
            "description": "The Universe is metaphorically treated as a concert hall by th

### Overview of Cosmic Vocalization

Cosmic Vocalization is a phenomenon that has garnered significant attention within the community, involving various individuals and groups who interpret and respond to it in different ways. It is perceived as a central focus, with implications that span from artistic interpretations to security concerns.

### Key Perspectives and Interpretations

#### Alex Mercer's View
Alex Mercer perceives Cosmic Vocalization as part of an interstellar duet, suggesting a responsive and perhaps harmonious interaction with cosmic events. This perspective highlights a more artistic and interpretive approach to understanding the phenomenon [Data: Reports (6)].

#### Taylor Cruz's Concerns
In contrast, Taylor Cruz raises concerns that Cosmic Vocalization might be a homing tune, which introduces a sense of urgency and potential threat. This viewpoint underscores the possibility that the phenomenon could have more ominous implications, necessitating caution and further i

In [11]:
# inspect the data used to build the context for the LLM responses
result.context_data["reports"]

Unnamed: 0,id,title,occurrence weight,content,rank
0,15,Dulce Base and the Paranormal Military Squad: ...,1.0,# Dulce Base and the Paranormal Military Squad...,9.5
1,11,Dulce and Paranormal Military Squad Operations,0.3,# Dulce and Paranormal Military Squad Operatio...,8.5
2,10,Paranormal Military Squad at Dulce Base: Decod...,0.3,# Paranormal Military Squad at Dulce Base: Dec...,8.5
3,7,Operation: Dulce and the Paranormal Military S...,0.2,# Operation: Dulce and the Paranormal Military...,8.5
4,8,Dr. Jordan Hayes and the Paranormal Military S...,0.18,# Dr. Jordan Hayes and the Paranormal Military...,8.5
5,1,Earth's Interstellar Communication Initiative,0.16,# Earth's Interstellar Communication Initiativ...,8.5
6,12,Paranormal Military Squad and Dulce Base Opera...,0.16,# Paranormal Military Squad and Dulce Base Ope...,7.5
7,13,Mission Dynamics and Leadership: Cruz and Wash...,0.16,# Mission Dynamics and Leadership: Cruz and Wa...,7.5
8,14,Dulce Base and Paranormal Military Squad: Brid...,0.12,# Dulce Base and Paranormal Military Squad: Br...,8.5
9,16,Dulce Military Base and Alien Intelligence Com...,0.08,# Dulce Military Base and Alien Intelligence C...,8.5


In [12]:
# inspect number of LLM calls and tokens
print(
    f"LLM calls: {result.llm_calls}, prompt tokens: {result.prompt_tokens}, output tokens: {result.output_tokens}."
)

LLM calls: 2, prompt tokens: 11350, output tokens: 797.


### Global search with dynamic community selection 

The goal of dynamic community selection reduce the number of community reports that need to be processed in the [map-reduce](https://microsoft.github.io/graphrag/posts/query/0-global_search/) operation. To that end, we take advantage of the hierachical structure of the indexed dataset. We first ask the LLM to rate how relevant each level 0 community is with respect to the user query, we then traverse down the child node(s) if the current community report is deemed relevant.

You can still set a `COMMUNITY_LEVEL` to filter out lower level community reports and apply dynamic community selection on the filtered reports.

Note that the dataset is quite small, with only consist of 20 communities from 2 levels (level 0 and 1). Dynamic community selection is more effective when there are large amount of content to be filtered out.

In [13]:
# set COMMUNITY_LEVEL to None to use all reports
COMMUNITY_LEVEL, DYNAMIC_SELECTION = None, True

communities = read_indexer_communities(community_df, entity_df, report_df)
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL, DYNAMIC_SELECTION)
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

print(f"Total report count: {len(report_df)}")
print(
    f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
)

Total report count: 20
Report count after filtering by community level None: 20


#### Global search with dynamic communities

In [14]:
dynamic_context_builder = GlobalCommunityContext(
    community_reports=reports,
    communities=communities,
    llm=llm,
    token_encoder=token_encoder,
    entities=entities,
    dynamic_selection=DYNAMIC_SELECTION,  # set dynamic selection to True to enable dynamic community selection
)

dynamic_search_engine = GlobalSearch(
    context_builder=dynamic_context_builder,
    **global_search_params,
)

In [15]:
result = await dynamic_search_engine.asearch(QUESTION)
print(result.response)

Level 0: 100%|██████████| 7/7 [00:01<00:00,  5.43it/s]
dynamic community selection: community 3 rating 1
dynamic community selection: community 0 rating 1
dynamic community selection: community 2 rating 1
dynamic community selection: community 1 rating 2
dynamic community selection: community 4 rating 1
dynamic community selection: community 5 rating 1
dynamic community selection: community 6 rating 4
dynamic community selection (took: 1s)
	rating distribution {1: 5, 2: 1, 4: 1}
	2 out of 20 community reports are relevant
	prompt tokens: 4810, output tokens: 7
Computing community weights...
Map response: {
    "points": [
        {
            "description": "Cosmic Vocalization is a focal point of interest within the community, drawing attention from various individuals and groups. It is perceived as part of an interstellar duet by Alex Mercer, suggesting a responsive approach to cosmic events [Data: Reports (6)].",
            "score": 90
        },
        {
            "description

### Overview of Cosmic Vocalization

Cosmic Vocalization is a significant phenomenon within the community, drawing attention from various individuals and groups. It is perceived as part of an interstellar duet by Alex Mercer, suggesting a responsive approach to cosmic events [Data: Reports (6)]. This indicates that Cosmic Vocalization may be an interactive or communicative event, potentially involving responses to other cosmic occurrences.

### Key Concerns and Interpretations

Taylor Cruz raises concerns about Cosmic Vocalization, fearing it might be a homing tune, which adds a layer of urgency and potential threat [Data: Reports (6)]. This perspective introduces the possibility that Cosmic Vocalization could have implications for security or navigation, possibly attracting unwanted attention or entities.

### Strategic and Security Implications

The Paranormal Military Squad is actively engaged with Cosmic Vocalization, indicating a strategic response to these cosmic phenomena as par

Dynamic community selection would filtered out community reports that are not relevant to the `QUESTION`. In this example, only 2 reports are used in the map-reduce operation, as opposite to 16 in the fixed community selection setting, hence reducing the overall prompt tokens needed. 

In [16]:
# inspect the data used to build the context for the LLM responses
result.context_data["reports"]

Unnamed: 0,id,title,occurrence weight,content,rank
0,1,Earth's Interstellar Communication Initiative,1.0,# Earth's Interstellar Communication Initiativ...,8.5
1,6,Cosmic Vocalization and Universe Interactions,0.125,# Cosmic Vocalization and Universe Interaction...,7.5


In [17]:
print(
    f"LLM calls: {result.llm_calls}, prompt tokens: {result.prompt_tokens}, output tokens: {result.output_tokens}."
)

LLM calls: 9, prompt tokens: 7748, output tokens: 758.
