source: <a href="https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1">GraphRAG</a>

RAG: use vector database to retrieve semantically similar text. <br/>
GraphRAG: enhance RAG by incorporating knowledge graphs (KGs). <br/>
Knowledge Graphs (KGs): data structures that store and link related or unrelated data based on their relationships.<br/>

GraphRAG has 2 processes: indexing and querying <br/>

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*CFrSdpijjpq7HD3h.png" width="500">

GraphRAG Query <br/>

Global Search: reasoning about holistic questions related to the whole data corpus by leveraging the community summaries.<br/>
Local Search: reasoning about specific entities by fanning out to their neighbors and associated concepts. <br/>

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*b4NUADKOIYWUB544.png" width="500">

Let install GraphRAG

In [65]:
! pip install git+https://github.com/zc277584121/graphrag.git
! pip install future
! pip install plotly

Collecting git+https://github.com/zc277584121/graphrag.git
  Cloning https://github.com/zc277584121/graphrag.git to /private/var/folders/dg/1lmjvnjn703dyf0qt4nzvyph0000gn/T/pip-req-build-3xopr50j
  Running command git clone --filter=blob:none --quiet https://github.com/zc277584121/graphrag.git /private/var/folders/dg/1lmjvnjn703dyf0qt4nzvyph0000gn/T/pip-req-build-3xopr50j
  Resolved https://github.com/zc277584121/graphrag.git to commit c7263931eb9b3b6c7cd72fc11e4a6d4b11e6d48b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2

Dependencies

In [66]:
import os
import re
import html
import requests
import networkx as nx
import numpy as np
import plotly.graph_objects as go

Data Preparation

In [67]:
# utility functions
def strip_html(text):
  """ remove HTML tags from a string """
  if not isinstance(text, str):
    return ""
  clean = re.compile("<.*?>")
  return re.sub(clean, "", text)

def preprocess_events(events):
  """ construct dictionary from event data """
  return [
    {
      "title": event["title"],
      "group_title": event["group_title"],
      "url": event["url"],
      "description": strip_html(event["description"]),
      "date": event["date"],
      "date_time": event["date_time"],
      "location": event["location"],
      "location_title": event["location_title"],
      "location_latitude": float(event["location_latitude"]) if event["location_latitude"] != None else 0,
      "location_longitude": float(event["location_longitude"]) if event["location_longitude"] != None else 0,
      "cost": event["cost"],
      "thumbnail": event["thumbnail"],
      "event_types": event["event_types"],
      "event_types_audience": event["event_types_audience"],
    }
    for event in events
  ]

def transform_event_to_sentence(event):
  # extract fields from the event record
  title = event.get("title", None)
  group_title = event.get("group_title", None)
  date = event.get("date", None)
  date_time = event.get("date_time", None)
  location = event.get("location", None)
  description = event.get("description", "").strip()
  location_title = event.get("location_title", None)
  cost = event.get("cost", None)
  event_types = event.get("event_types", None)
  event_types_audience = event.get("event_types_audience", None)
  url = event.get("url", None)

  sentence = ""
  sentence += f"The event titled '{title}' " if title else "The event with no title "
  sentence += f"is organized by {group_title} " if group_title else ""
  sentence += f"and is scheduled to take place on {date}." if date else ""
  sentence += f" At {date_time}." if date_time else ""
  sentence += f" The event will be held at {location}." if location else ""
  sentence += f" ({location_title})." if location_title else ""
  sentence += f" The cost for attending is {cost}." if cost else " The cost for attending is FREE."
  sentence += f" Description: {description}." if description else ""
  sentence += f" This event is categorized under {event_types[0]}." if event_types else ""
  sentence += f" The intended audience for this event is for {','.join(event_types_audience)}." if event_types_audience else ""
  sentence += f" For more details, you can visit the event page at {url}." if url else ""

  return html.unescape(sentence)

In [68]:
# create graphrag data for indexing
index_root = os.path.join(os.getcwd(), 'graphrag_index')
os.makedirs(os.path.join(index_root, 'input'), exist_ok=True)

In [70]:
# read and write tamu events data
file_path = os.path.join(index_root, 'input', 'tamu_events.txt')
tamu_events_url = "https://calendar.tamu.edu/live/json/events/group"
tamu_events = requests.get(tamu_events_url)
data = tamu_events.json()

# transform it to sentences and store in .txt file
preprocessed_data = preprocess_events(data)
with open(file_path, 'w') as f:
  for i, event in enumerate(preprocessed_data[:50]):
    sentence = transform_event_to_sentence(event)
    f.write(f"Event {i + 1}: {sentence} \n\n")

Use GraphRAG to index the text file. <br/> 
To initialize your workspace, let’s first run the graphrag.index --init command.

In [None]:
! python -m graphrag.index --init --root ./graphrag_index

To use OLLAMA (LOCAL LLM) instead of ChatGPT, copy below setting to the graphrag_index

In [79]:
! cp settings.yaml ./graphrag_index/

Run the indexing to create a graph

In [80]:
! python -m graphrag.index --root ./graphrag_index

[2K🚀 [32mReading settings from graphrag_index/settings.yaml[0m
[2K⠹ GraphRAG Indexer 
[2K[1A[2K⠹ GraphRAG Indexer e.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠹ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠼ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) [90m━[0m [35m100%[0m [36m…[0m [33m0…[0m
[2K[1A[2K[1A[2K⠼ GraphRAG Indexer 
├── Load

Run a query, only supports global methods

In [75]:
! python -m graphrag.query --root ./graphrag_index --method global "When is the next career fairs?"



INFO: Reading settings from graphrag_index/settings.yaml
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/suphanut_jamonnak/.pyenv/versions/3.11.5/envs/tamu-chatbot/lib/python3.11/site-packages/graphrag/query/__main__.py", line 84, in <module>
    run_global_search(
  File "/Users/suphanut_jamonnak/.pyenv/versions/3.11.5/envs/tamu-chatbot/lib/python3.11/site-packages/graphrag/query/cli.py", line 67, in run_global_search
    final_nodes: pd.DataFrame = pd.read_parquet(
                                ^^^^^^^^^^^^^^^^
  File "/Users/suphanut_jamonnak/.pyenv/versions/3.11.5/envs/tamu-chatbot/lib/python3.11/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/Users/suphanut_jamonnak/.pyenv/versions/3.11.5/envs/tamu-chatbot/lib/python3.11/site-packages/pandas/io/parquet.py", line 267, in read
    path_or_handle, handles, f

Let visualize the graph created by GraphRag

In [None]:
# load the GraphML file
graph = nx.read_graphml("./graphrag_index/output/20240708-161630/artifacts/summarized_graph.graphml")
# create a 3D spring layout with more separation
pos = nx.spring_layout(graph, dim=3, seed=42, k=0.5)
# extract node positions
x_nodes = [pos[node][0] for node in graph.nodes()]
y_nodes = [pos[node][1] for node in graph.nodes()]
z_nodes = [pos[node][2] for node in graph.nodes()]
# extract edge positions
x_edges = []
y_edges = []
z_edges = []

for edge in graph.edges():
    x_edges.extend([pos[edge[0]][0], pos[edge[1]][0], None])
    y_edges.extend([pos[edge[0]][1], pos[edge[1]][1], None])
    z_edges.extend([pos[edge[0]][2], pos[edge[1]][2], None])
# generate node colors based on a colormap
node_colors = [graph.degree(node) for node in graph.nodes()]
node_colors = np.array(node_colors)
node_colors = (node_colors - node_colors.min()) / (node_colors.max() - node_colors.min())  # Normalize to [0, 1]
# create the trace for edges
edge_trace = go.Scatter3d(
    x=x_edges, y=y_edges, z=z_edges,
    mode='lines',
    line=dict(color='lightgray', width=0.5),
    hoverinfo='none'
)
# create the trace for nodes
node_trace = go.Scatter3d(
    x=x_nodes, y=y_nodes, z=z_nodes,
    mode='markers+text',
    marker=dict(
        size=7,
        color=node_colors,
        colorscale='Viridis',  # Use a color scale for the nodes
        colorbar=dict(
            title='Node Degree',
            thickness=10,
            x=1.1,
            tickvals=[0, 1],
            ticktext=['Low', 'High']
        ),
        line=dict(width=1)
    ),
    text=[node for node in graph.nodes()],
    textposition="top center",
    textfont=dict(size=10, color='black'),
    hoverinfo='text'
)

In [None]:
# create the 3D plot
fig = go.Figure(data=[edge_trace, node_trace])
# update layout for better visualization
fig.update_layout(
    title='3D Graph Visualization',
    showlegend=False,
    scene=dict(
        xaxis=dict(showbackground=False),
        yaxis=dict(showbackground=False),
        zaxis=dict(showbackground=False)
    ),
    margin=dict(l=0, r=0, b=0, t=40),
    annotations=[
        dict(
            showarrow=False,
            text="Interactive 3D visualization of GraphML data",
            xref="paper",
            yref="paper",
            x=0,
            y=0
        )
    ]
)
# show the plot
fig.show()