##### LlamaIndex - Property Graph Index basic example
doc [link](https://docs.llamaindex.ai/en/stable/examples/property_graph/property_graph_basic/)

##### Setup

In [3]:
### SETUP ###

# get keys
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [5]:
### SETUP ###

# get example unstructured data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-05-30 18:35:48--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-05-30 18:35:48 (6.66 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [6]:
### SETUP ###

# not sure why async is necessary
import nest_asyncio
nest_asyncio.apply()

In [7]:
### SETUP ###

# idk
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

##### Construction

In [8]:
### Construction ###

from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

index = PropertyGraphIndex.from_documents(
    documents,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    show_progress=True,
)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 31.95it/s]
Extracting paths from text: 100%|██████████| 22/22 [00:11<00:00,  1.85it/s]
Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 12798.15it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.46it/s]
Generating embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.16it/s]


Explanation
- `PropertyGraphIndex.from_documents()` - we loaded documents into an index
- `Parsing nodes` - the index parsed the documents into nodes
- `Extracting paths from text` - the nodes were passed to an LLM, and the LLM was prompted to generate knowledge graph triples (i.e. paths)
- `Extracting implicit paths` - each node.relationships property was used to infer implicit paths
- `Generating embeddings` - embeddings were generated for each text node and graph node (hence this happens twice)

In [15]:
### Construction ###
# alternative method using lower level API

from llama_index.core.indices.property_graph import (
    ImplicitPathExtractor,
    SimpleLLMPathExtractor,
)

index = PropertyGraphIndex.from_documents(
    documents,
    embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),
    kg_extractors=[
        ImplicitPathExtractor(),
        SimpleLLMPathExtractor(
            llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
            num_workers=4,
            max_paths_per_chunk=10,
        ),
    ],
    show_progress=True,
)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 32.69it/s]
Extracting implicit paths: 100%|██████████| 22/22 [00:00<00:00, 26088.40it/s]
Extracting paths from text: 100%|██████████| 22/22 [00:13<00:00,  1.63it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.54it/s]
Generating embeddings: 100%|██████████| 6/6 [00:00<00:00, 10.51it/s]


In [16]:
index.property_graph_store.save_networkx_graph(name="./kg.html")

##### Querying

In [17]:
retriever = index.as_retriever(
    include_text=False,  # include source text, default True
)

nodes = retriever.retrieve("What happened at Interleaf and Viaweb?")

for node in nodes:
    print(node.text)

Viaweb -> Was -> Lucky for us
Viaweb -> Charged -> $100 a month for a small store
Interleaf -> Had -> Few years to live
Interleaf -> Made -> Software
Interleaf -> Added -> Scripting language
We -> Called -> Viaweb
Viaweb -> Was -> Poor
Viaweb -> Was -> Profitable
Interleaf -> Had -> Group
Viaweb -> Was -> Growing rapidly
Viaweb -> Charged -> $300 a month for a big one
Interleaf -> Wanted -> Lisp hacker
Yahoo -> Bought -> Viaweb
Viaweb -> Was -> Inexpensive
Viaweb -> Started by -> Someone
Philz -> Founded by -> Someone
Y combinator -> Started by -> Someone
Group -> Seemed -> Big
Software -> Tends to eat -> High end software
Software -> Was -> Quite fun to work on
Group -> Called -> Release engineering
Software -> Working -> Way
Technology companies -> Be run by -> Sales people
Textile department -> Belonged to -> Neighbor
Computers -> Weren't powerful enough -> To run a more complicated interpreter
Painting department -> Seemed -> Rigorous
Illustration -> Seemed -> Rigorous
Architecture

In [18]:
query_engine = index.as_query_engine(
    include_text=True,
)

response = query_engine.query("What happened at Interleaf and Viaweb?")

print(str(response))

Interleaf made software for creating documents and added a scripting language based on Lisp. They were looking for a Lisp hacker to work with this language. Viaweb was a company that developed an online store builder software. They struggled financially initially but eventually became profitable and were acquired by Yahoo. The founders of Viaweb later started Y Combinator.


##### Storage

In [19]:
index.storage_context.persist(persist_dir="./storage")

from llama_index.core import StorageContext, load_index_from_storage

index = load_index_from_storage(
    StorageContext.from_defaults(persist_dir="./storage")
)