# Natural Language to SPARQL Query Conversion in AllegroGraph

Learn how to transform natural language questions into precise SPARQL queries using AllegroGraph's intelligent query conversion system. This system combines SHACL shapes with a vector store of successful query pairs to continuously improve its translation accuracy.

## Core Capabilities

This notebook demonstrates how to:

* Execute natural language queries and examine their SPARQL translations and results
* Enhance future query accuracy by storing successful natural language-to-SPARQL mappings
* Access and manage your database of historical query conversions
* Curate your query history by removing specific entries

## Technical Foundation

The system leverages:

* SHACL shapes automatically derived from your repository structure
* Vector embeddings of previously successful query pairs to train the system


## System Requirements

* AllegroGraph version **8.3.0** or higher
* `agraph-python` client version **104.2.0** or higher
* OpenAI API key for embedding and query generation

## Important Version Note

For AllegroGraph versions prior to **8.4.0**, the natural language query vector database must be initialized through either:

* AGWebView interface ([documentation](https://gruff.allegrograph.com/classic-webview/doc/natural-language-sparql-queries.html))
* Manual setup process (demonstrated in this notebook)

Please set your connection parameters in the following cell.

In [16]:
from franz.openrdf.connect import ag_connect
import requests

from llm_utils import create_nlq_vdb

REPO='kennedy'
NLQ_VDB='kennedy_vdb'

AGRAPH_USER=''
AGRAPH_PASSWORD=''
AGRAPH_HOST='demo2.franz.com'
AGRAPH_PORT='10079'

OPENAI_API_KEY=''
EMBEDDING_MODEL="text-embedding-ada-002"
EMBEDDER="openai"

#connect to main repo here-----------
conn = ag_connect(
    REPO,
    clear=True,
    user=AGRAPH_USER,
    password=AGRAPH_PASSWORD,
    host=AGRAPH_HOST,
    port=AGRAPH_PORT
)
conn.addFile('../tutorial-files/kennedy.ntriples')

#connecting to nlq vdb
nlq_conn = create_nlq_vdb(
    REPO,
    conn,
    NLQ_VDB,
    host=AGRAPH_HOST,
    port=AGRAPH_PORT,
    user=AGRAPH_USER,
    password=AGRAPH_PASSWORD,
    openai_api_key=OPENAI_API_KEY,
    embedder=EMBEDDER,
    embedding_model=EMBEDDING_MODEL
)

## Executing Natural Language Queries

AllegroGraph's natural language query system translates human language into SPARQL queries through an iterative process that leverages previous successful queries and repository structure.

### Query Execution Syntax

```python
response = conn.execute_nl_query(
    prompt, # the natural language question
    vdb_spec, # the desired vector database. It is possible to have multiple vdb's per regular connection  
    with_fti=True, # allow the system to use existing Free Text Indices to perform searches over text
    asterisk_in_select_clause=False # allows the system to use * in the Select clause of a SPARQL query
)
```

###  Response Components

The system returns a comprehensive response object containing:

* Generated Query (`response['query']`)
   * The translated SPARQL query
* Query Results (`response['result']`)
   * The data returned from executing the SPARQL query
* Reference Examples (`response['referenced-examples']`)
   * Similar successful query pairs from the vector database
   * Used as templates to guide translation
* Failed Attempts (`response['failed-attempts']`)
   * Track record of unsuccessful translation attempts
   * System makes up to 3 attempts before giving up
*  SHACL Shapes (`response['shapes']`)
   * Repository structure definitions used to guide query generation
   * Ensures generated queries align with data model

In [17]:
prompt = "show me 10 triples"

result = conn.execute_nl_query(
    prompt,
    NLQ_VDB
)

In [18]:
result.keys()

dict_keys(['query', 'result', 'failed-attempts', 'referenced-examples'])

## The SPARQL Query

In [19]:
sparql_query = result['query']
print(sparql_query)


PREFIX ns2: <http://www.franz.com/simple#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o
}
LIMIT 10



## The Results

The `result` includes the column `names` as well as a set of `sample-values`, which is a list of lists

In [20]:
result['result']

{'names': ['s', 'p', 'o'],
 'sample-values': [['<http://www.franz.com/simple#person1>',
   '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
   '<http://www.franz.com/simple#person>'],
  ['<http://www.franz.com/simple#person1>',
   '<http://www.franz.com/simple#first-name>',
   '"Joseph"'],
  ['<http://www.franz.com/simple#person1>',
   '<http://www.franz.com/simple#middle-initial>',
   '"Patrick"'],
  ['<http://www.franz.com/simple#person1>',
   '<http://www.franz.com/simple#last-name>',
   '"Kennedy"'],
  ['<http://www.franz.com/simple#person1>',
   '<http://www.franz.com/simple#suffix>',
   '"none"']],
 'values': [['<http://www.franz.com/simple#person1>',
   '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>',
   '<http://www.franz.com/simple#person>'],
  ['<http://www.franz.com/simple#person1>',
   '<http://www.franz.com/simple#first-name>',
   '"Joseph"'],
  ['<http://www.franz.com/simple#person1>',
   '<http://www.franz.com/simple#middle-initial>',
   '"Patrick"'],
  ['<http:/

## Storing Natural Language and SPARQL Query Pairs

### Overview

To improve future query translations, you can store successful natural language to SPARQL query mappings in your vector database. This creates a growing knowledge base that enhances translation accuracy over time.

### Storage Method

```python
nlq_conn.store_nl_query_pair(
    prompt, # the original natural language question
    sparql_query # the SPARQL query
)
```

### Best Practices for Query Storage

* **Quality Control**
   * Verify query results before storage
   * Only store pairs that produce correct and intended results
   * Review the SPARQL query structure for optimal patterns
* **Connection Management**
   * Use the NLQ repository connection (`nlq_conn`), not your data repository connection
   * Ensure you're connected to the correct vector database instance
* **Quality Assurance**
   * Incorrect query pairs can be removed from the system
   * Regular review of stored pairs helps maintain system accuracy
   * See below for instructions on removing problematic pairs

### Warning
⚠️ Storing incorrect query pairs will degrade system performance. Always validate query results before storage.

In [21]:
nlq_conn.store_nl_query_pair(prompt, sparql_query)

## Managing Natural Language Query Pairs

### Retrieving Stored Query Pairs

You can access and search your database of successful query translations using the get_nl_query_pairs method. This tool supports both direct retrieval and similarity-based searching. A main use case for this method is acquiring the `id`s of stored queries to be deleted.

### Method Syntax

```python
results = nlq_conn.get_nl_query_pairs(
    offset=0,                    # Pagination start point
    limit=100,                   # Maximum pairs to return
    neighbor_search="",          # Optional similarity search text
    neighbor_search_limit=10,    # Maximum similar pairs to return (only used if neighbore search used)
    neighbor_search_min_score=.5 # Minimum similarity threshold (0-1) (only used if neighbor search is used)
)
```
### Search Modes

* **Direct Retrieval**
   * Lists all stored query pairs from the vector database
   * Useful for system audit and maintenance
   * Supports pagination for large collections
* **Similarity Search**
   * Finds semantically similar queries
   * Uses the same vector search mechanism as `execute_nl_query`
   * Helps understand system decision-making

### Return Values

Each result contains:

* `id`: Unique identifier for the pair (used for deletion)
* `nl-query`: Original natural language question
* `sparql-query`: Corresponding SPARQL translation


In [22]:
result = nlq_conn.get_nl_query_pairs()
result

[{'id': '_:bED2FE913x2',
  'nl-query': 'show me 10 triples',
  'sparql-query': '\nPREFIX ns2: <http://www.franz.com/simple#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nSELECT ?s ?p ?o\nWHERE {\n  ?s ?p ?o\n}\nLIMIT 10\n'},
 {'id': '_:bED2FE913x6',
  'nl-query': 'Give me all persons, their first name, middle initial, last name, as well as their birth date.',
  'sparql-query': '\nPREFIX ns2: <http://www.franz.com/simple#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nSELECT ?person ?firstName ?middleInitial ?lastName ?birthYear\nWHERE {\n  ?person rdf:type ns2:person ;\n          ns2:first-name ?firstName ;\n          ns2:middle-initial ?middleInitial ;\n          ns2:last-name ?lastName ;\n          ns2:birth-year ?birthYear .\n}\n'}]

## Query Performance Optimization Through Caching

### Intelligent Query Reuse

When you execute a previously stored natural language query, AllegroGraph's system automatically:

* Checks the vector store for exact matches
* Returns SPARQL translations immediately
* Bypasses the LLM translation process entirely

### Benefits

* Near-instantaneous response times for repeat queries
* Consistent SPARQL translations
* Reduced API costs by avoiding unnecessary LLM calls
* Guaranteed reproducibility of successful queries

In [23]:
%%time
prompt = "show me 10 triples"

result = conn.execute_nl_query(
    prompt,
    NLQ_VDB
)
print(result['query'])


PREFIX ns2: <http://www.franz.com/simple#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?s ?p ?o
WHERE {
  ?s ?p ?o
}
LIMIT 10

CPU times: user 4.04 ms, sys: 0 ns, total: 4.04 ms
Wall time: 29.1 ms


# More examples

We will now run a more difficult query.
**Note** that it is possible that it will not work on the first try!

In [24]:
prompt = "Give me all persons, their first name, middle initial, last name, as well as their birth date."
result = conn.execute_nl_query(prompt, NLQ_VDB)

In [25]:
sparql_query = result['query']
print(sparql_query)


PREFIX ns2: <http://www.franz.com/simple#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?person ?firstName ?middleInitial ?lastName ?birthYear
WHERE {
  ?person rdf:type ns2:person ;
          ns2:first-name ?firstName ;
          ns2:middle-initial ?middleInitial ;
          ns2:last-name ?lastName ;
          ns2:birth-year ?birthYear .
}



In [26]:
print(result['result'])

{'names': ['person', 'firstName', 'middleInitial', 'lastName', 'birthYear'], 'sample-values': [['<http://www.franz.com/simple#person76>', '"Patrick"', '"Joseph"', '"Kennedy"', '"1967"'], ['<http://www.franz.com/simple#person75>', '"Katherine"', '"Anne"', '"Gershman"', '"1959"'], ['<http://www.franz.com/simple#person74>', '"Edward"', '"M"', '"Kennedy"', '"1961"'], ['<http://www.franz.com/simple#person73>', '"Michael"', '"nil"', '"Allen"', '"1958"'], ['<http://www.franz.com/simple#person72>', '"Kara"', '"Anne"', '"Kennedy"', '"1960"']], 'values': [['<http://www.franz.com/simple#person76>', '"Patrick"', '"Joseph"', '"Kennedy"', '"1967"'], ['<http://www.franz.com/simple#person75>', '"Katherine"', '"Anne"', '"Gershman"', '"1959"'], ['<http://www.franz.com/simple#person74>', '"Edward"', '"M"', '"Kennedy"', '"1961"'], ['<http://www.franz.com/simple#person73>', '"Michael"', '"nil"', '"Allen"', '"1958"'], ['<http://www.franz.com/simple#person72>', '"Kara"', '"Anne"', '"Kennedy"', '"1960"'], ['<

Again, assuming we are happy with this response, we can store the result! If you are not happy with the result you can run the next cell instead to "manually" add the correct version.

In [27]:
#OPTIONAL CELL - RUN IF THE ABOVE RESULT IS INCORRECT
sparql_query = """
    PREFIX ns2: <http://www.franz.com/simple#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    
    SELECT ?person ?firstName ?middleInitial ?lastName ?birthYear
    WHERE {
      ?person rdf:type ns2:person ;
              ns2:first-name ?firstName ;
              ns2:middle-initial ?middleInitial ;
              ns2:last-name ?lastName ;
              ns2:birth-year ?birthYear .
    }
"""

In [28]:
nlq_conn.store_nl_query_pair(prompt, sparql_query)

Now we will run a very similar query, and we will show in the output how the previous query was used as a template for the new one.

In [29]:
prompt = "Give me all persons, their first name, middle initial, last name, as well as their birth date and their optional death date"
result = conn.execute_nl_query(prompt, NLQ_VDB)

In [30]:
sparql_query = result['query']
print(sparql_query)


PREFIX ns2: <http://www.franz.com/simple#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?person ?firstName ?middleInitial ?lastName ?birthYear ?deathYear
WHERE {
  ?person rdf:type ns2:person ;
          ns2:first-name ?firstName ;
          ns2:middle-initial ?middleInitial ;
          ns2:last-name ?lastName ;
          ns2:birth-year ?birthYear .
  OPTIONAL {
    ?person ns2:death-year ?deathYear .
  }
}



This query is very similar to the previous query! We can see that the prior query was "referenced" by checking the following field

In [31]:
result['referenced-examples']

[{'nl-query': 'Give me all persons, their first name, middle initial, last name, as well as their birth date.',
  'sparql-query': '\n    PREFIX ns2: <http://www.franz.com/simple#>\n    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n    \n    SELECT ?person ?firstName ?middleInitial ?lastName ?birthYear\n    WHERE {\n      ?person rdf:type ns2:person ;\n              ns2:first-name ?firstName ;\n              ns2:middle-initial ?middleInitial ;\n              ns2:last-name ?lastName ;\n              ns2:birth-year ?birthYear .\n    }\n'},
 {'nl-query': 'Give me all persons, their first name, middle initial, last name, as well as their birth date.',
  'sparql-query': '\nPREFIX ns2: <http://www.franz.com/simple#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nSELECT ?person ?firstName ?middleInitial ?lastName ?birthYear\nWHERE {\n  ?person rdf:type ns2:person ;\n          ns2:first-name ?firstName ;\n          ns2:middle-initial ?middleInitial ;\n          ns2:l

## Managing Vector Store Content: Removing Query Pairs

### Overview
AllegroGraph provides tools to curate your query pair database, allowing you to maintain quality and relevance by removing outdated or incorrect entries.

### Deletion Process

#### Step 1: Identify Pairs to Remove

You can locate specific query pairs using either:

* Direct ID lookup from all known pairs
* Similarity search to find related queries

In [32]:
result = nlq_conn.get_nl_query_pairs(
    neighbor_search="give me all persons and their names",
    neighbor_search_min_score=.75
)

for response in result:
    print(response)
    print()

{'id': '_:bED2FE913x8', 'nl-query': 'Give me all persons, their first name, middle initial, last name, as well as their birth date.', 'sparql-query': '\n    PREFIX ns2: <http://www.franz.com/simple#>\n    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n    \n    SELECT ?person ?firstName ?middleInitial ?lastName ?birthYear\n    WHERE {\n      ?person rdf:type ns2:person ;\n              ns2:first-name ?firstName ;\n              ns2:middle-initial ?middleInitial ;\n              ns2:last-name ?lastName ;\n              ns2:birth-year ?birthYear .\n    }\n'}

{'id': '_:bED2FE913x6', 'nl-query': 'Give me all persons, their first name, middle initial, last name, as well as their birth date.', 'sparql-query': '\nPREFIX ns2: <http://www.franz.com/simple#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nSELECT ?person ?firstName ?middleInitial ?lastName ?birthYear\nWHERE {\n  ?person rdf:type ns2:person ;\n          ns2:first-name ?firstName ;\n          ns2:middle-i

#### Step 2: Remove Selected Pairs

Use the `delete_nl_query_pairs` method to remove entries:

```python
nlq_conn.delete_nl_query_pairs(
    ['id1', 'id2'] # a list of id's to delete
)
```


In [33]:
ids = [result[0]['id']]
nlq_conn.delete_nl_query_pairs(ids)

Note that now the nl query pair is no longer found

In [34]:
result = nlq_conn.get_nl_query_pairs(
    neighbor_search="give me all persons and their names",
    neighbor_search_min_score=.75
)

for response in result:
    print(response)
    print()

{'id': '_:bED2FE913x2', 'nl-query': 'show me 10 triples', 'sparql-query': '\nPREFIX ns2: <http://www.franz.com/simple#>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\nSELECT ?s ?p ?o\nWHERE {\n  ?s ?p ?o\n}\nLIMIT 10\n'}



### Best Practices for Deletion

#### Verification

* Always verify IDs before deletion
* Consider keeping a backup of deleted pairs
* Review the impact on similar queries

#### Maintenance Strategy

* Regularly review and clean up outdated pairs
* Remove incorrect or poorly performing translations
* Consider periodic quality audits

#### System Impact

* Deletions are permanent and cannot be undone
* Removing pairs may affect translation quality for similar queries
* Consider adding replacement pairs for critical query types

## AllegroGraph Natural Language Query System: Summary and Best Practices

### Core Methods Summary:

* Execute Query (`execute_nl_query`)
   * Translates natural language to SPARQL
   * Returns query, results, and translation metadata
* Store Query Pairs (`store_nl_query_pair`)
   * Saves successful translations for future use
   * Improves system performance over time
* Retrieve Query Pairs (`get_nl_query_pairs`)
   * Access stored query translations
   * Search for similar queries
* Delete Query Pairs (`delete_nl_query_pairs`)
   * Remove incorrect or outdated translations
   * Maintain system quality

### Recommended Implementation Strategy

#### Initial System Training

* Develop Training Dataset
   * Create comprehensive query set covering common use cases
   * Include variations of similar questions
   * Cover edge cases and complex queries
   * Test and validate SPARQL translations
* Batch Training
   * Store validated query pairs
   * Review system performance
   * Iterate and refine training set

#### Continuous Improvement

* Monitor user queries and success rates
* Regular review of stored pairs
* Periodic system retraining
* Expert validation of edge cases

### Important Implementation Notes

#### Result Validation and System Enhancement

* User Responsibility
   * **Result Verification**: Users must verify that SPARQL results match their query intent
   * **Empty Results**: A correct SPARQL query may return no results if the requested data doesn't exist in the repository
   * **False Positives**: Syntactically correct queries may not capture the user's intended semantic meaning

#### Best Practices

* Start with a strong foundation of validated queries
* Implement user feedback mechanisms
* Regular quality audits
* Monitor system performance metrics
* Maintain documentation of known good queries