# GraphRAG: Embedding with Properties and Selectors in AllegroGraph

This notebook demonstrates how to leverage AllegroGraph's vector store capabilities to implement efficient semantic search and retrieval. We will show the following:

* Set up and authenticate a connection to an AllegroGraph repository
* Transform your repository into a powerful vector store by integrating it with modern embedding models
* Optimize vector search performance by enriching embeddings with custom properties
* Create mappings between embeddings and their source objects for seamless data retrieval

The techniques shown here enable you to build sophisticated GraphRAG applications that combine the semantic power of vector search with AllegroGraph's native graph capabilities.

Prerequisites:

* AllegroGraph version `8.3.0` or higher
* A running AllegroGraph instance (local, cloud, or Docker)

To begin, you'll need to configure your connection parameters in the `ag_connect` statement below.

In [6]:
from franz.openrdf.connect import ag_connect
from franz.openrdf.query.query import QueryLanguage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from tqdm import tqdm
import tarfile

conn = ag_connect('chomsky-embedding-test', user='', password='', host='', port='')
conn.setNamespace('c', 'http://chomsky.com/')
c = conn.namespace('http://chomsky.com/')

with tarfile.open('chomsky.nq.tgz', 'r:gz') as tar:
    # Extract all contents to the current directory
    tar.extractall()
    
if conn.size() == 0:
    conn.addFile('chomsky.nq')

We also define a function to split a long chunk of text

In [48]:
def split_text(text: str) -> list:
    documents = [Document(page_content=text)]
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=100)
    docs = text_splitter.split_documents(documents)
    return docs

## Converting a Repository to a Vector Store

We'll now enhance the `chomsky-embedding` repository with vector store capabilities. This transformation preserves all existing triples while enabling the storage and retrieval of embeddings. While AllegroGraph supports creating dedicated vector stores, combining embeddings with triples in a single repository offers several advantages:

* Unified data management
* Seamless integration between semantic and vector-based queries
* Simplified maintenance and backup procedures
* Reduced operational complexity

The following code configures the embedding model using OpenAI. The process is model-agnostic - you can easily adapt it to use other providers like Anthropic.

In [44]:
API_KEY=''
EMBEDDER='openai' 
EMBEDDING_MODEL='text-embedding-3-small'

You only have to run the following once to convert the regular repository to a vector store.

In [50]:
conn.convert_to_vector_store(
    EMBEDDER,
    api_key=API_KEY,
    model=EMBEDDING_MODEL)

# Embedding text with properties

In this section, we'll create semantically rich embeddings from Chomsky's works while preserving crucial metadata. Our process involves three key steps:

1. **Query Extraction**: We'll retrieve all textual content from Chomsky's articles and books stored in our triple store.

2. **Text Processing**: For optimal embedding quality and search performance, we'll chunk longer texts (>1000 characters) into more manageable segments. This ensures each embedding captures a coherent unit of meaning.

3. **Enhanced Embedding Creation**: Using the `add_objects` method, we'll generate embeddings for each text segment. Crucially, we'll enrich these embeddings with additional properties that enable:
    * Precise filtering during vector searches
    * Contextual metadata preservation
    * Efficient narrowing of the search space for `nearest_neighbor` queries
    * More targeted responses in `askMyDocuments` operations



By attaching properties to our embeddings, we create a more sophisticated search infrastructure that can leverage both semantic similarity and metadata filters to deliver more accurate results.

In [51]:
#books
query_string = """
SELECT ?name ?paragraph ?text WHERE {
    ?book a c:Book ;
            rdfs:label ?name ;
            c:hasParagraph ?paragraph .
      ?paragraph c:hasContent ?text . }
limit 1000"""
result = conn.prepareTupleQuery(QueryLanguage.SPARQL, query_string).evaluate()

triples = []
with result:
    for binding_set in tqdm(result):
        name = binding_set.getValue('name').label # the book name
        paragraph = binding_set.getValue('paragraph') # the paragraph URI
        text = binding_set.getValue('text').label # the actual text of the paragraph
        
        chunks = split_text(text) # splitting large paragraphs into chunks of 1000 characters
        
        for chunk in chunks:                  # for each chunk in the list of chunk...
            vector_id = conn.add_objects(     # the method used to embed text
                chunk.page_content,           # grab the content of the chunk
                properties = {                # declare the properties per embedding
                    "documentType": "Book",   # for this query values are "Book" 
                    "name": name              # the name of the book the text is from
                }
            )
            vector_uri = conn.createURI(vector_id)   # convert the vector_id returned from the embedding add_objects function
            triples.append((paragraph, c.vectorId, vector_uri)) # connect the paragraph where the text originated to the vector ID
        
conn.addTriples(triples) 
        

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [10:34<00:00,  1.57it/s]


In [52]:
# Now we add articles (similar process to above)

query_string = """
SELECT ?name ?article ?text WHERE {
    ?article a c:NewsArticle ;
             rdfs:label ?name ;
             c:hasJournalist c:Noam_Chomsky ;
             c:hasContent ?text . }"""
result = conn.prepareTupleQuery(QueryLanguage.SPARQL, query_string).evaluate()

triples = []
with result:
    for binding_set in tqdm(result):
        name = binding_set.getValue('name').label
        article = binding_set.getValue('article')
        text = binding_set.getValue('text').label
        
        documents = split_text(text)
        
        for doc in documents:
            vector_id = conn.add_objects(
                doc.page_content,
                properties = {
                    "documentType": "Article",
                    "name": name
                }
            )
            vector_uri = conn.createURI(vector_id)
            triples.append((article, c.vectorId, vector_uri))
        
conn.addTriples(triples) 

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:40<00:00,  3.72s/it]


## Nearest Neighbor Search

This section demonstrates how to perform high-precision semantic searches across your embedded texts using [AllegroGraph's Nearest Neighbor](https://franz.com/agraph/support/documentation/ns-allegrograph-8.0.0-llm-nearestNeighbor.html) functionality.

### Core Search Capabilities
The `nearest_neighbor` method finds the most semantically similar texts to your query by comparing vector embeddings in high-dimensional space. Here's the basic syntax:

```python
result = conn.nearest_neighbor(
    query_text,     # The text to search for
    minScore=0.0,   # Minimum similarity threshold (0.0-1.0)
    topN=5,         # Number of results to return
    selector=None   # Optional property-based filtering
)
```

The method returns a list of matches, where each match contains:

* `vector_id`: Unique identifier for the embedding
* `score`: Similarity score (higher means more similar)
* `text`: The matching text content

### Advanced Filtering with Selectors
The selector parameter enables powerful filtering based on the properties we attached during the embedding process. This allows you to:

* Narrow searches to specific document types, time periods, or authors
* Combine semantic similarity with metadata constraints
* Improve search performance by reducing the candidate pool

For practical examples of selector usage, refer to the embedding creation code above where we defined our custom properties.

In [4]:
from pprint import pprint

result = conn.nearest_neighbor("Democratic culture and organizing principles based on the unending quest for justice and freedom")
pprint(result)

[['<http://franz.com/vdb/id/1279>',
  0.6333528,
  'Though it is natural for doctrinal systems to seek to induce pessimism, '
  'hopelessness, and despair, reality is different. There has been substantial '
  'progress in the unending quest for justice and freedom in recent years, '
  'leaving a legacy that can be carried forward from a higher plane than '
  'before. Opportunities for education and organizing abound. As in the past, '
  'rights are not likely to be granted by benevolent authorities, or won by '
  'intermittent actionsattending a few demonstrations or pushing a lever in '
  'the personalized quadrennial extravaganzas that are depicted as "democratic '
  'politics." As always in the past, the tasks require dedicated day-by-day '
  'engagement to createin part re-createthe basis for a functioning democratic '
  'culture in which the public plays some role in determining policies, not '
  'only in the political arena, from which it is largely excluded, but also in '
  'the

## ## Limiting scope of vector search

While semantic similarity is powerful, you can make your searches more precise by filtering based on document properties. Here's how to narrow your search to specific document types, like Chomsky's articles.

### Selector Syntax
The selector parameter accepts a SPARQL-like query pattern that filters embeddings based on their properties. 

* Omit the `SELECT` statement - the selector is a pattern, not a complete query
* Always use `?id` as the subject variable
* Construct property predicates by combining the base URI <http://franz.com/vdb/prop/> with your property keys
* Maintain the curly braces `{}` around your pattern

These property predicates correspond directly to the keys we defined earlier in the properties dictionary when creating our embeddings.

In [6]:
selector = """{ ?id <http://franz.com/vdb/prop/documentType> "Article"}"""

result = conn.nearest_neighbor(
    "Democratic culture and organizing principles based on the unending quest for justice and freedom",
    selector=selector
)
pprint(result)

[['<http://franz.com/vdb/id/1379>',
  0.4999426,
  'These range from very general principles, such as the truism that we should '
  'apply to ourselves the same standards we do to others (if not harsher '
  'ones), to more specific doctrines, such as a dedication to promoting '
  'democracy and human rights, which is proclaimed almost universally, even by '
  'the worst monsters – though the actual record is grim, across the '
  'spectrum.\n'
  '\n'
  'A good place to start is with John Stuart Mill’s classic “On Liberty.” Its '
  'epigraph formulates “The grand, leading principle, towards which every '
  'argument unfolded in these pages directly converges: the absolute and '
  'essential importance of human development in its richest diversity.”\n'
  '\n'
  'The words are quoted from Wilhelm von Humboldt, a founder of classical '
  'liberalism. It follows that institutions that constrain such development '
  'are illegitimate, unless they can somehow justify themselves.\n'
  '\n'
  'C

Notice that we now get different results!

We also added book names as our properties, so we can limit the search to both a book, and a specific one at that!

In [7]:
selector = """{ ?id <http://franz.com/vdb/prop/documentType> "Book" ;
                    <http://franz.com/vdb/prop/name> "Failed States" . }"""

result = conn.nearest_neighbor("Political systems and activism", selector=selector)
pprint(result)

[['<http://franz.com/vdb/id/1279>',
  0.5564257,
  'Though it is natural for doctrinal systems to seek to induce pessimism, '
  'hopelessness, and despair, reality is different. There has been substantial '
  'progress in the unending quest for justice and freedom in recent years, '
  'leaving a legacy that can be carried forward from a higher plane than '
  'before. Opportunities for education and organizing abound. As in the past, '
  'rights are not likely to be granted by benevolent authorities, or won by '
  'intermittent actionsattending a few demonstrations or pushing a lever in '
  'the personalized quadrennial extravaganzas that are depicted as "democratic '
  'politics." As always in the past, the tasks require dedicated day-by-day '
  'engagement to createin part re-createthe basis for a functioning democratic '
  'culture in which the public plays some role in determining policies, not '
  'only in the political arena, from which it is largely excluded, but also in '
  'the

# Ask my Documents with GraphRAG

AllegroGraph's `askMyDocuments` predicate combines vector similarity search with `Large Language Model (LLM)` capabilities to provide context-aware answers from your document collection. This feature:

* Retrieves the most relevant document segments using semantic search
* Uses an LLM to synthesize a coherent answer
* Provides transparency by returning the source text used to generate the response

### Core Query Structure
The `askMyDocuments` magic predicate is implemented through a SPARQL query that orchestrates the GraphRAG process:

```
PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/> 
PREFIX prop: <http://franz.com/vdb/prop/> # Prefix with which embedding properties are stored
PREFIX franzOption_openaiApiKey: <franz:{api_key}>  # add your API key, can also be set in config file or AGWebView

SELECT ?response ?score ?citation ?content WHERE {
    ( ?response ?score ?citation ?content )
        llm:askMyDocuments
    ( "your question"            # the question to answer
      "vector-store"             # the name of the vector store. In this case the repo and vector store are the same!
      topN                       # the number of results
      minScore                   # the minimum matching score
      keyword:selector "{}" ) }  # Can add selector pattern in {}
```

The query returns:

* `?response`: The LLM-generated answer
* `?score`: Confidence score for the response
* `?citation`: Reference to the source document
* `?content`: Original text used to generate the answer

We'll wrap this functionality in a Python function for easier use.

In [88]:
def askMyDocuments(conn, api_key, question, n=5, min_score=0, selector=None):
    if selector:
        selector = f"""keyword:selector '{selector}' """
    else:
        selector=''
    query_string = f"""
        PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/> 
        PREFIX prop: <http://franz.com/vdb/prop/>
        PREFIX franzOption_openaiApiKey: <franz:{api_key}> 
        
        SELECT ?response ?score ?citation ?content WHERE {{
            ( ?response ?score ?citation ?content )
                llm:askMyDocuments
            ( "{question}" "chomsky-embedding" {str(n)} {str(min_score)} {selector} ) . }}"""
    result = conn.prepareTupleQuery(QueryLanguage.SPARQL, query_string).evaluate()
    results = []
    with result:
        for binding_set in result:
            results.append(
                {
                    "response": binding_set.getValue('response').label,
                    "score": binding_set.getValue('score').floatValue(),
                    "citation": binding_set.getValue('citation'),
                    "content": binding_set.getValue('content').label
                }
            )
    return results

In [89]:
test = askMyDocuments(conn, API_KEY, "what does Chomsky think of anarchism", selector='{ ?id prop:documentType "Book" }')
test

[{'response': "The discussion revolves around authority structures and their lack of self-justification, with a call for dismantling illegitimate ones to expand freedom, aligning with libertarian and anarchist principles. There's also an examination of how 'anti-American' is a term used uniquely in the US to label critics, reflecting a totalitarian notion more common in non-free societies. Meanwhile, leftist radicalism is seen as growing but remains a minor concern. Critics across societies face vilification, but the US, despite its flaws, offers substantial freedom. Lastly, debates on wars, notably by Walzer, illustrate polarized views with some left critics seen as unjustly accused, highlighting challenges in intellectual discourse.",
  'score': 0.47449344,
  'citation': <http://franz.com/vdb/id/99>,
  'content': "Where there are structures of authority, domination, and hierarchysomebody gives the orders and somebody takes themthey are not self-justifying. They have to justify themse

# Automated Concept Extraction Using LLMs and Vector Search
Traditional entity extraction often requires complex NLP pipelines or custom extractors. We'll demonstrate a more elegant approach that combines LLM capabilities with vector search to automatically identify and link concepts from your text to a structured taxonomy.

## Overview of the Process

Our approach uses two key components:

* `llm:response` predicate: Leverages LLM capabilities to identify significant terms within text
* `nearest_neighbor` search: Maps identified terms to formal concepts in our taxonomy

## The Chomsky Taxonomy

Our repository includes a comprehensive taxonomy covering:

* People: Historical figures, philosophers, politicians
* Places: Countries, regions, institutions
* Concepts: Political theories, linguistic terms, social movements

We'll focus on leaf concepts (those without narrower terms) and create embeddings that capture both:

* `skos:prefLabel`: Standard terms
* `skos:altLabel`: Synonyms and variants

Each embedding will be enriched with:

* Concept categorization as a property
* links to the original taxonomy concept

This structure enables:

* Precise concept extraction from text
* Hierarchical navigation of extracted concepts
* Relationship discovery through taxonomy connections

The resulting system provides a powerful way to automatically annotate text with structured conceptual metadata, enabling sophisticated semantic analysis and retrieval. 

In [57]:
query_string = """
SELECT ?concept ?label ?broader ?broaderPref WHERE {
  ?broader skos:narrower ?concept ;
           skos:prefLabel ?broaderPref .
  ?concept skos:prefLabel|skos:altLabel .

  FILTER NOT EXISTS { ?concept skos:narrower ?narrower . }
}
"""
result = conn.prepareTupleQuery(QueryLanguage.SPARQL, query_string).evaluate()

triples = []
with result:
    for binding_set in tqdm(result):
        concept = binding_set.getValue('concept')
        label = binding_set.getValue('label').label
        broader = binding_set.getValue('broader')
        broaderPref = binding_set.getValue('broaderPref').label
        
        vector_id = conn.add_objects(
            label,
            properties = {
                "type": "entity_extractor",
                "broaderPref": broaderPref
            }
        )
        vector_uri = conn.createURI(vector_id)
        
        triples.append((concept, c.vectorId, vector_uri))
        
conn.addTriples(triples)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1339/1339 [23:24<00:00,  1.05s/it]


## Two-Stage Concept Extraction Pipeline

Let's break down our intelligent concept extraction process that combines LLM entity recognition with vector-based concept mapping.

### Process Overview

Our SPARQL query orchestrates a sophisticated pipeline:

#### Stage 1: LLM Entity Recognition

First, we use llm:response to leverage LLM capabilities for identifying key entities:

* The LLM analyzes the input text
* Extracts mentions of people, places, and concepts
* Returns these as a structured list in the ?response variable

#### Stage 2: Concept Mapping

Then, we take each extracted entity and map it to our taxonomy:

* Use `nearest_neighbor` to find the closest matching concept embeddings
* Leverage our pre-established vector ID to concept URI mappings
* Retrieve the formal preferred labels and concept metadata

#### Benefits of This Approach

This two-stage process combines:

* The flexibility and contextual understanding of LLMs for entity recognition
* The precision and structure of our embedded taxonomy for standardization
* The power of vector similarity for handling variations and synonyms

The result is a robust system that can identify and standardize concepts even when they appear in different forms or contexts within the text.

In [92]:
def extract_concepts(conn, text, n=3, min_score=.5, broader=None):
    if broader:
        selector = f"?id prop:broaderPref '{broader}' . "
    else:
        selector = ""
    query_string = f"""
        PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/> 
        PREFIX prop: <http://franz.com/vdb/prop/>
        PREFIX franzOption_openaiApiKey: <franz:{API_KEY}>  

        SELECT DISTINCT ?concept ?pref ?score WHERE {{
          
          {{SELECT ?response WHERE {{
              ?response llm:response '''Extract the important people, places, and concepts from the following text: {text}''' .
          }} }}
          
          (?vector_id ?score) llm:nearestNeighbor (?response "chomsky-embedding" {str(n)} {str(min_score)} 
                    keyword:selector "{{ ?id prop:type 'entity_extractor' . {selector} }}" ) . 
          ?concept c:vectorId ?vector_id ;
                  skos:prefLabel ?pref . }}"""
    results = []
    result = conn.prepareTupleQuery(QueryLanguage.SPARQL, query_string).evaluate()
    with result:
        for binding_set in result:
            results.append(
                {
                    "concept": binding_set.getValue('concept'),
                    "pref": binding_set.getValue('pref').label,
                    "score": binding_set.getValue('score').floatValue()
                }
            )
    return results
                    

In [62]:
results = extract_concepts(conn, "Barack Obama was elected in 2008 in the United States", min_score=.95)
for result in results:
    print(result)
    print()

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/1624>, 'pref': 'Barack Obama', 'score': 0.9997594}

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/234>, 'pref': 'United States', 'score': 1.0000001}



## Now with a broader label specified

Since we also specified the broader concept of each concept in the properties of the embeddings, we can narrow the search through a text. In this case, we only want to extract a "President of the United States"

In [69]:
results = extract_concepts(conn, "Obama is from the United States, Angela Merkel is from Germany", broader="President of the United States", min_score=.90)
for result in results:
    print(result)
    print()

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/1624>, 'pref': 'Barack Obama', 'score': 0.9997617}



Now one large chunk of text as an example!

In [98]:
text = """Reagan's successor, George H. W. Bush, went on to provide further aid to Saddam,
badly needed after the war with Iran that he had launched. Bush also invited Iraqi nuclear engineers
to come to the United States for advanced training in weapons production. In April 1990, he dispatched a
high-level Senate delegation, led by future Republican presidential candidate Bob Dole, to convey his 
warm regards to his friend Saddam and to assure him that he should disregard irresponsible criticism from the
"haughty and pampered press," and that such miscreants had been removed from Voice of America.13 The fawning
before Saddam continued until he suddenly turned into a new Hitler a few months later when he disobeyed orders,
or perhaps misunderstood them, and invaded Kuwait, with illuminating consequences that I must leave aside here."""
results = extract_concepts(conn, text, min_score=.90)
for result in results:
    print(result)
    print()

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/1872>, 'pref': 'George H.W. Bush', 'score': 0.936558}

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/1871>, 'pref': 'Ronald Reagan', 'score': 0.9999998}

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/2020>, 'pref': 'Saddam Hussein', 'score': 1.0000002}

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/121>, 'pref': 'Iran', 'score': 1.0}

{'concept': <https://allegrograph.poolparty.biz/MLUPCM/135>, 'pref': 'Kuwait', 'score': 0.9999997}



# Some more Vector Store Utility Functions

We start by doing a nearest neigbor search to get some vector id's 

In [20]:
result = conn.nearest_neighbor("Political systems and activism")
vector_id = result[0][0][1:-1]
print(vector_id)

http://franz.com/vdb/id/1279


1. `conn.object_property_value` - return the value of the given vector_id, and their property value

In [38]:
result = conn.object_property_value(
    vector_id,
    "documentType"
)
print(result)

"Book"


2. `conn.object_text` - we can also get the text of a given vector ID. Since we connected the paragraph to the vector ID we could also do this with a SPARQL query, but it's built into the vector store automatically as well

In [31]:
text = conn.object_text(vector_id)
print(text)

'Though it is natural for doctrinal systems to seek to induce pessimism, hopelessness, and despair, reality is different. There has been substantial progress in the unending quest for justice and freedom in recent years, leaving a legacy that can be carried forward from a higher plane than before. Opportunities for education and organizing abound. As in the past, rights are not likely to be granted by benevolent authorities, or won by intermittent actionsattending a few demonstrations or pushing a lever in the personalized quadrennial extravaganzas that are depicted as "democratic politics." As always in the past, the tasks require dedicated day-by-day engagement to createin part re-createthe basis for a functioning democratic culture in which the public plays some role in determining policies, not only in the political arena, from which it is largely excluded, but also in the crucial economic arena, from which it is excluded in principle. There are many ways to promote democracy at'

3. `conn.object_embedding` - We can get the actual embedding of a vector. It is a list of floating point numbers

In [33]:
embedding = conn.object_embedding(vector_id)
print(embedding)

[0.038543507, 0.039096102, 0.06697452, 0.06520621, 0.06691926, -0.0071906435, 0.04254982, 0.05912767, -0.0085514086, 0.04492598, 0.019755274, -0.01518255, -0.027643567, 0.0029494762, 0.035393715, 0.015375958, -0.011963683, 0.038101432, 0.04044996, 0.03823958, 0.011044995, -0.0113074765, -0.005671007, 0.029011244, -0.006731299, 0.023374772, -0.036498904, 0.05940397, -0.008157685, 0.03158081, 0.03572527, -0.026828492, -0.026068673, 0.02225577, 0.026897568, 0.06885335, -0.019948682, -0.01353858, 0.042909008, -0.0108308636, 0.0234024, -0.072224176, -0.017917897, 0.033210963, 0.022504434, -0.027201492, -0.01679889, -0.025225967, -0.00120103076, 0.021426873, -0.038709283, 0.0002702535, 0.042356414, -0.019561866, -0.044124714, -0.018926382, -0.054596394, -0.03287941, -0.08449179, -0.0037265632, 0.0086135757, -0.0236787, 0.01917505, -0.006320306, 0.005719359, 0.021357799, -0.026331156, -0.001844286, 0.020073016, 0.005995656, 0.033929336, 0.0015360415, 0.060785455, 0.0120603874, -0.020155904, 0

In [45]:
conn.size()

285274