# Environment / Dependencies

In [10]:
#@ Install modules
%pip install -U -r requirements.txt

Collecting langchain_experimental (from -r requirements.txt (line 3))
  Downloading langchain_experimental-0.3.2-py3-none-any.whl.metadata (1.7 kB)
INFO: pip is looking at multiple versions of langchain-experimental to determine which version is compatible with other requirements. This could take a while.
  Downloading langchain_experimental-0.3.1.post1-py3-none-any.whl.metadata (1.7 kB)
  Downloading langchain_experimental-0.3.1-py3-none-any.whl.metadata (1.7 kB)
  Downloading langchain_experimental-0.3.0-py3-none-any.whl.metadata (1.7 kB)
  Downloading langchain_experimental-0.0.65-py3-none-any.whl.metadata (1.7 kB)
  Downloading langchain_experimental-0.0.64-py3-none-any.whl.metadata (1.7 kB)
Downloading langchain_experimental-0.0.64-py3-none-any.whl (204 kB)
Installing collected packages: langchain_experimental
Successfully installed langchain_experimental-0.0.64
Note: you may need to restart the kernel to use updated packages.


In [2]:
#@ Configure import paths.
import sys
sys.path.append("../../")

# Initialize environment variables.
from utils import initialize_environment
initialize_environment()

# Data to Load
For this notebook, we'll work on loading the first 100 articles from Wikipedia. We use Wikipedia data from the [2wikimultihop](https://github.com/Alab-NII/2wikimultihop) dataset. To execute the rest of the notebook, you will need to download [para_with_hyperlink.zip](https://www.dropbox.com/s/wlhw26kik59wbh8/para_with_hyperlink.zip) to the `wikimultihop` directory.

In [3]:
from itertools import islice
from datasets.wikimultihop.load import wikipedia_lines

NUM_LINES_TO_LOAD = 100
lines_to_load = list(islice(wikipedia_lines(), 100))

# Content-Centric: GraphVectorStore

In [4]:
#@ Create GraphVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.graph_vectorstores.cassandra import CassandraGraphVectorStore
import cassio

cassio.init(auto=True)
TABLE_NAME = "wiki_load"
store = CassandraGraphVectorStore(
    embedding = OpenAIEmbeddings(),
    node_table=TABLE_NAME,
    insert_timeout = 1000.0,
)

In [6]:
#@ Empty the table (optional)
if input("clear data(y/N): ").lower() == "y":
    print("Clearing data...")
    from cassio.config import check_resolve_session, check_resolve_keyspace
    session = check_resolve_session()
    keyspace = check_resolve_keyspace()

    session.execute(f"TRUNCATE TABLE {keyspace}.{TABLE_NAME};")
    print("Done")
else:
    print("Skipped clearing data")

Clearing data...
Done


In [8]:
#@ Load Data Into GraphVectorStore
if input("load data (y/N): ").lower() == "y":
    print("Loading entity-centric data...")
    from time import perf_counter

    start = perf_counter()
    from datasets.wikimultihop.load import parse_document
    kg_documents = [parse_document(line) for line in lines_to_load]
    store.add_documents(kg_documents)
    end = perf_counter()
    print(f"Loaded (and written) {NUM_LINES_TO_LOAD} in {end - start:0.2f}s")
else:
    print("Skipped loading data")

Loading entity-centric data...
Loaded (and written) in 1.43s


When I run this, it takes about 1.43s to load these 100 documents. Under the hood, this extracts links from the wikipedia page. I have previously run all 5,989,847 documents from the dump through this process using async, and it took about 2.5 hours total. 

# Entity Centric: LLMGraphTrasnformer

The following is based on LangChain's ["How to construct knowledge graphs"](https://python.langchain.com/docs/how_to/graph_constructing/#llm-graph-transformer). It uses `LLMGraphTransformer` to transform documents into knowledge graph nodes and edges.

In [12]:
from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")
llm_transformer = LLMGraphTransformer(llm=llm)

from time import perf_counter
start = perf_counter()

documents_to_load = [Document(page_content=line) for line in lines_to_load]
graph_documents = llm_transformer.convert_to_graph_documents(documents_to_load)

end = perf_counter()
print(f"Loaded (but NOT written) {NUM_LINES_TO_LOAD} in {end - start:0.2f}s")

Loaded (but NOT written) in 1024.13s


In [None]:
from langchain_community.graphs.memgraph_graph import MemgraphGraph

from time import perf_counter
start = perf_counter()

graph_store = MemgraphGraph(url = "", username = "", password = "")
graph_store.add_graph_documents(graph_documents)

end = perf_counter()
print(f"Written in {end - start:0.2f}s")

Just loading the data (not writing it to a Graph Store) took 1024.13s. Extrapolating to all 5,989,847 documents, this would be 710 days -- nearly 2 **years**! Of course, there are likely similar opportunities for parallelism -- assuming the same reduction, that would get it down to a mere 74 days. Assuming that could be done in a fault-tolerant way (or no errors happened), the resulting graph documents would still need to be written to a graph store.

# Conclusion
In this short notebook, we saw how articles from a Wikipedia dump could be loaded into a `GraphVectorStore` in mere hours. The same content would take months of processing time and incur significant LLM costs to load into a knowledge graph.