# Building Knowledge Graphs at Production Scale
Using Knowledge Graphs to improve the results of Retrieval-Augmented Generation (RAG) applications is widely discussed. Most examples demonstrate how to build a knowledge graph using a relatively small number of documents. This may be because the typical approach – extracting fine-grained, entity-centric information just doesn’t scale. Running each document through a model to extract the entities (nodes) and relationships (edges) takes too long (and costs too much) to run on large datasets.

We’ve talked about the idea of content-centric knowledge graphs – a vector-store allowing links between chunks – as an easier to use and more efficient approach. In this post we put that to the test. We load a subset of the wikipedia articles from the [2wikimultihop](https://github.com/Alab-NII/2wikimultihop) dataset using both techniques and discuss what this means for loading the entire dataset. We demonstrate the results of some questions over the loaded data. We’ll also load the entire dataset – nearly 6 million documents – into a content-centric [GraphVectorStore](https://www.datastax.com/blog/now-in-langchain-graph-vector-store-add-structured-data-to-rag-apps).

In [None]:
#@ Install modules
%pip install -U -r requirements.txt

In [2]:
#@ Configure import paths.
import sys
sys.path.append("../../")

# Initialize environment variables.
from utils import initialize_environment
initialize_environment()

## Data to Load
For this notebook, we'll work on loading the first 100 articles from Wikipedia. We use Wikipedia data from the [2wikimultihop](https://github.com/Alab-NII/2wikimultihop) dataset. To execute the rest of the notebook, you will need to download [para_with_hyperlink.zip](https://www.dropbox.com/s/wlhw26kik59wbh8/para_with_hyperlink.zip) to the `wikimultihop` directory.

In [2]:
from utils import download_file
download_file("https://www.dropbox.com/s/wlhw26kik59wbh8/para_with_hyperlink.zip?dl=1", "../../datasets/wikimultihop/para_with_hyperlink.zip")

File '../../datasets/wikimultihop/para_with_hyperlink.zip' already exists, skipping download.


In [3]:
from itertools import islice
from datasets.wikimultihop.load import wikipedia_lines

NUM_LINES_TO_LOAD = 100
lines_to_load = list(islice(wikipedia_lines(), NUM_LINES_TO_LOAD))

## Entity Centric: LLMGraphTrasnformer

Loading documents into an entity-centric graph store like Neo4j was done using LangChain’s `LLMGraphTransformer`. The code is based on LangChain's ["How to construct knowledge graphs"](https://python.langchain.com/docs/how_to/graph_constructing/#llm-graph-transformer).

In [4]:
#@ Extract GraphDocuments
from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_community.callbacks import get_openai_callback

llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
llm_transformer = LLMGraphTransformer(llm=llm)

from time import perf_counter
start = perf_counter()

documents_to_load = [Document(page_content=line) for line in lines_to_load]

with get_openai_callback() as cb:
    graph_documents = llm_transformer.convert_to_graph_documents(documents_to_load)
    end = perf_counter()

    print(f"Loaded (but NOT written) {NUM_LINES_TO_LOAD} in {end - start:0.2f}s")
    print(f"OpenAI stats: prompt tokens {cb.prompt_tokens}, completion tokens {cb.completion_tokens}, total cost {cb.total_cost}")

Loaded (but NOT written) 100 in 477.67s
OpenAI stats: prompt tokens 116180, completion tokens 26145, total cost 0.0


Start a Neo4j Docker instance:

### Unix
```bash
docker run -d -p=7474:7474 -p=7687:7687 -e NEO4J_AUTH=neo4j/password -e NEO4J_PLUGINS=\[\"apoc\"\] neo4j
```

### Powershell
```bash
docker run -d -p=7474:7474 -p=7687:7687 -e NEO4J_AUTH=neo4j/password -e NEO4J_PLUGINS='"[\"apoc\"]"' neo4j
```

In [21]:
#@ Write GraphDocuments to Neo4j
from langchain_community.graphs import Neo4jGraph

from time import perf_counter
start = perf_counter()

entity_centric_store = Neo4jGraph(url="bolt://localhost:7687", username="neo4j", password="password")
entity_centric_store.add_graph_documents(graph_documents)

end = perf_counter()
print(f"Written in {end - start:0.2f}s")

Written in 12.77s


## Content-Centric: GraphVectorStore
Loading the data into `GraphVectorStore` is roughly the same as loading it into a vector store. The only addition is that we compute metadata indicating how the pages link to each other.

In [5]:
#@ Configure Tables
import cassio
cassio.init(auto=True)
TABLE_NAME = "wiki_load"

In [1]:
#@ Empty the table (optional)
if input("clear data(y/N): ").lower() == "y":
    print("Clearing data...")
    from cassio.config import check_resolve_session, check_resolve_keyspace
    session = check_resolve_session()
    keyspace = check_resolve_keyspace()

    session.execute(f"TRUNCATE TABLE {keyspace}.{TABLE_NAME};")
    print("Done")
else:
    print("Skipped clearing data")

In [9]:
#@ Create GraphVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.graph_vectorstores.cassandra import CassandraGraphVectorStore

content_centric_store = CassandraGraphVectorStore(
    embedding = OpenAIEmbeddings(),
    node_table=TABLE_NAME,
    #insert_timeout = 1000.0,
)

In [10]:
#@ Add links to documents
import json
from langchain_core.graph_vectorstores.links import METADATA_LINKS_KEY, Link

def parse_document(line: str) -> Document:
    para = json.loads(line)

    id = para["id"]
    links = {
        Link.outgoing(kind="href", tag=id)
        for m in para["mentions"]
        if m["ref_ids"] is not None
        for id in m["ref_ids"]
    }
    links.add(Link.incoming(kind="href", tag=id))
    return Document(
        id = id,
        page_content = " ".join(para["sentences"]),
        metadata = {
            "content_id": para["id"],
            METADATA_LINKS_KEY: list(links)
        },
    )


In [11]:
#@ Load Data Into GraphVectorStore
print("Loading entity-centric data...")
from time import perf_counter

start = perf_counter()
kg_documents = [parse_document(line) for line in lines_to_load]
content_centric_store.add_documents(kg_documents)
end = perf_counter()
print(f"Loaded (and written) {NUM_LINES_TO_LOAD} in {end - start:0.2f}s")

Loading entity-centric data...
Loaded (and written) 100 in 1.89s


## Loading Benchmarks
Running at 100 rows, the entity-centric approach using gpt-4o took 405.93s to extract the GraphDocumuents and 10.99s to write them to Neo4j, while the content-centric approach took 1.43s. Extrapolating, it would take 41 weeks to load all 5,989,847 pages using the entity-centric approach, and about 24-hours using the content-centric approach. However, thanks to parallelism the content-centric approach runs in only 2.5 hours! Assuming the same parallelism benefits, it would still take over 4 weeks to load everything using the entity-centric approach. I didn’t try it, since the estimated cost would be $58,700 assuming everything worked the first time!

**Bottom-line, the entity-centric approach of extracting knowledge graphs from content using an LLM was both time and cost prohibitive at scale. On the other hand, using GraphVectorStore was fast and cheap.**

## Example Answers

In [24]:
#@ VectorGraphStore RAG chain
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

content_centric_retriever = content_centric_store.as_retriever()

prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


content_centric_chain = (
    {"context": content_centric_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


Please use the `langsmith sdk` instead:
  pip install langsmith
Use the `pull_prompt` method.
  res_dict = client.pull_repo(owner_repo_commit)


In [25]:
from langchain.chains import GraphCypherQAChain
entity_centric_chain = GraphCypherQAChain.from_llm(graph=entity_centric_store, llm=llm, verbose=False)


In [26]:
QUESTION1 = "When was 'The Circle' released?"
QUESTION2 = "Where is Urup located?"

print("Entity Centric\n--------------")
start = perf_counter()
with get_openai_callback() as cb:
    print(f"Question 1: {QUESTION1}")
    print(entity_centric_chain.invoke(QUESTION1))
    print(f"\nQuestion 2: {QUESTION2}")
    print(entity_centric_chain.invoke(QUESTION2))

    end = perf_counter()
    print(f"Entity Centric Time in {end - start:0.2f}s")
    print(f"OpenAI stats: prompt tokens {cb.prompt_tokens}, completion tokens {cb.completion_tokens}, total cost {cb.total_cost}")

print("\nContent Centric\n---------------")
start = perf_counter()
with get_openai_callback() as cb:
    print(f"Question 1: {QUESTION1}")
    print(content_centric_chain.invoke(QUESTION1))
    print(f"\nQuestion 2: {QUESTION2}")
    print(content_centric_chain.invoke(QUESTION2))

    end = perf_counter()
    print(f"Content Centric Time in {end - start:0.2f}s")
    print(f"OpenAI stats: prompt tokens {cb.prompt_tokens}, completion tokens {cb.completion_tokens}, total cost {cb.total_cost}")

Entity Centric
--------------
Question 1: When was 'The Circle' released?




{'query': "When was 'The Circle' released?", 'result': "I don't know the answer."}

Question 2: Where is Urup located?




{'query': 'Where is Urup located?', 'result': "I don't know the answer."}
Entity Centric Time in 5.36s
OpenAI stats: prompt tokens 582, completion tokens 61, total cost 0.0

Content Centric
---------------
Question 1: When was 'The Circle' released?
The Circle was released in 1988.

Question 2: Where is Urup located?
Urup is located in Badakhshan Province in north-eastern Afghanistan.
Content Centric Time in 1.96s
OpenAI stats: prompt tokens 450, completion tokens 24, total cost 0.0


It may be surprising that the fine-grained Neo4j graph returns useless answers. Looking at the logging from the chain, we see some of why this happens:

```
> Entering new GraphCypherQAChain chain...
Generated Cypher:
cypher
MATCH (a:Album {id: 'The Circle'})-[:RELEASED_BY]->(r:Record_label)
RETURN a.id, r.id

Full Context:
[{'a.id': 'The Circle', 'r.id': 'Restless'}]

> Finished chain.
{'query': "When was 'The Circle' released?", 'result': "I don't know the answer."}
```

So, the fine-grained schema only returned information about the record label, which wasn't helpful for answering the question.

## Conclusion
Extracting fine-grained, entity-specific knowledge graphs is time and cost prohibitive at scale. When asked questions over the subset of data that was loaded, the additional granularity (and extra cost loading the fine-grained graph) returned more tokens to include the prompt, but generated useless answers!

`GraphVectorStore` takes a coarse-grained, content-centric approach that makes it fast and easy to build a knowledge graph. You can start with your existing code for populating a `VectorStore` using LangChain and add links (edges) between chunks to improve the retrieval process.

Graph RAG is a useful tool for enabling GenAI RAG applications to retrieve more deeply relevant context. But using a fine-grained, entity-centric approach does not scale to production needs. If you're looking to add knowledge graph capabilities to your RAG application, try `GraphVectorStore`.