# AllegroGraph LLM Embedding Examples

In [1]:
from franz.openrdf.connect import ag_connect
from franz.openrdf.vocabulary import RDF, RDFS
from llm_utils import BufferTriples, addArbitraryTextString, read_text, FindNearestNeighbors, AskMyDocuments
from franz.openrdf.model.value import URI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
import shortuuid
import pandas as pd
import urllib
import textwrap
import os

#set your environment variables here
os.environ['AGRAPH_USER'] = 'your username'
os.environ['AGRAPH_PASSWORD'] = 'your password'
os.environ['AGRAPH_HOST'] = 'your hostname' #Note that if you are on an AllegroGraph cloud server, your agraph_host might look something like the following: https://ag1zzkvywf0yteww.allegrograph.cloud
os.environ['AGRAPH_PORT'] = 'your port' #If you you are on an AllegroGraph cloud server, your port should be '443'

Before starting any other work it is very important to set your openai API key in your AG server. The directions are present in the README, but are added here as well due to their importance.

1. Please navigate to your local installation of the new webview
2. Go to the repository where your data is stored (`llm-philosophy` if you're using the repo created from this demo)
3. Go to `Repository Control` in the left column under `Repository` and search for `query execution options`. Select it.
4. Select `+ New Query Option` and add **openaiApiKey** as the _name_, and your OpenAI api key as the _value_. Set the `Scope` to **Repository** (You will have to do this for both examples)
5. Don't forget to save by hitting `Save Query Options`!

# Import Philosophy Books into AllegroGraph

In this example we read in a few philosophy books from [Project Gutenberg](https://www.gutenberg.org/) and then show the power of two new AllegroGraph Magic Predicates, `llm:nearestNeighbor` and `llm:askMyDocuments`. First we select a few books from the Gutenberg Library and gather some data from the website manually and add it to the following dictionary. The keys of the dictionary are what we want the URI of the book to be in the graph. The author and title speak for themselves, and then the contents are a link to a text version of book ([example here](https://www.gutenberg.org/cache/epub/7370/pg7370.txt)):

In [3]:
documents = {
    "http://franz.com/llm/SecondTreatiseOfGovernment":{
        "author": "John Locke",
        "title": "Second Treatise of Government",
        "contents": "https://www.gutenberg.org/cache/epub/7370/pg7370.txt",
    },
    "http://franz.com/llm/TheRepublic":{
        "author": "Plato",
        "title": "The Republic",
        "contents": "https://www.gutenberg.org/cache/epub/1497/pg1497.txt",
    },
    "http://franz.com/llm/CritiqueOfPureReason":{
        "author": "Immanuel Kant",
        "title": "The Critique of Pure Reason",
        "contents": "https://www.gutenberg.org/cache/epub/4280/pg4280.txt"
    },
    "http://franz.com/llm/TreatiseOfHumanNature":{
        "author": "David Hume",
        "title": "A Treatise of Human Nature",
        "contents": "https://www.gutenberg.org/cache/epub/4705/pg4705.txt"
    }
}

Then we connect to a new AllegroGraph repository and declare some local namespaces we will use to add the documents to the graph as triples. Please add the necessary parameters to connect to your AllegroGraph server.

In [2]:
conn = ag_connect('llm-philosophy')
conn.setNamespace('', 'http://franz.com/llm/')
f = conn.namespace('http://franz.com/llm/')

We then loop through the documents and grab the contents using the `urllib` library. For each document we split using [Langchain's Recursive Text Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) and add the chunks of text to the graph. You can examine the splitting of the text in the _addArbitraryTextString_ code in the `llm_utils.py` file in this same directory.

In [5]:
buffer = BufferTriples(conn)
for document in documents:
    id_uri = conn.createURI(document)
    buffer.add((id_uri, RDF.TYPE, f.Document, id_uri))
    buffer.add((id_uri, f.title, documents[document]['title'], id_uri))
    buffer.add((id_uri, RDFS.LABEL, documents[document]['title'], id_uri))
    text = read_text(documents[document]['contents'])
    buffer = addArbitraryTextString(conn, f, buffer, text, id_uri)
buffer.flush_triples()

Note that the following image may not be the same for you depending on the splitter used.

![philosophy-books](images/philosophy-books.png)

## Indexing the text fragments with openAI

In order to ask question of your own documents you first need to index all the text fragment using agtool and openAI embeddings. The documentation on how to do this and how it works can be found [here](https://internal.franz.com/people/dm/docs/llmembed.html). There are currently two ways to do this embedding, via the AllegroGraph Webview, or via our command line `agtool`.

### Embedding via AllegroGraph Webview

To embed via WebView navigate to `Repository Control` in the left column under `Repository`. Then select `Create LLM Embedding`. In this page you can:
* Create or open a vector store
* Choose you embedder and model, and add your api key
* Decide which predicates/types to include or exlude from the embedding. In this example we will embed all text that is the object of the triple with predicate `<http://franz.com/llm/text>`

![webview-embedding](images/webview-llm-embed-philosophy.png)

### Embedding via agtool

For agtool to work it needs some metadata that we put in a `.def` file as described in the link above. In this tutorial we define `philosophy.def` and assume we are using `OpenAI` for embedding. Note that this method is currently not available if you are on an AllegroGraph Cloud server.
```text
gpt
 api-key "your-openai-api-key-here"
 vector-database-name 10035/philosophy-vec
 embedder openai
 include-predicates <http://franz.com/llm/text>
```
The explanations of thses parameters can be found in the documentation linked above.

We run the following command to index the documents (the command will change depending on your location of documents/server etc.) This command assumes you created an alias for agtool, and that your server runs on localhost:10035. 

```shell
agtool llm index localhost:10035/llm-philosophy philosophy.def 
```

Once the embedding is done we can starting querying the graph!

## Nearest Neighbor SPARQL Query

The general syntax for the query clause of `llm:nearestNeighbors` is as follows:
```
(?uri ?score ?originalText) llm:nearestNeighbor (?text ?vector-database ?topN ?minScore)
```

Please make sure you have set your OpenAI API key in Webview before attempting the following queries.

We will start with a sample SPARQL query for nearest neighbors.

In [3]:
query_string = """
        PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/> 
        select * where { 
            (?uri ?score ?originalText) llm:nearestNeighbor ("government" "philosophy-vec" 10 0.1)  }"""
with conn.executeTupleQuery(query_string) as result:
    df = result.toPandas()
df.head()

Unnamed: 0,originalText,score,uri
0,"private education of parents, contribute to th...",0.813789,<http://franz.com/vdb/id/227>
1,"government, and do either promote, or not, wha...",0.813346,<http://franz.com/vdb/id/4150>
2,"to a kind of natural authority, that the chief...",0.811307,<http://franz.com/vdb/id/4288>
3,And the government is the ruling power in each...,0.809082,<http://franz.com/vdb/id/3415>
4,"in the execution is connected, though not imme...",0.808754,<http://franz.com/vdb/id/238>


### Wrapping nearestNeighbor in a function. 
We wrote a sample class that allows users to find nearest neighbors and also perform some additional tasks with the response object. The parameters are:
- `conn` - The connection object
- `phrase` - the phrase for which you are looking to find the nearest neighbors
- `vector_db` - the vector database
- `number` - (optional) set to 10 if not declared, sets the maximum number of neighbors you wished returned
- `confidence` - (optional) set to .5 if note declared, sets the minimum matching score for all returned vectors


In [4]:
nn = FindNearestNeighbors(conn, 'government', 'philosophy-vec')

private education of parents, contribute to the giving us a sense of  honour and duty in the strict
regulation of our actions with regard to  the properties of others.          SECT. VII OF THE ORIGIN
OF GOVERNMENT      Nothing is more certain, than that men are, in a great measure, governed  by
interest, and that even when they extend their concern beyond  themselves, it is not to any great
distance; nor is it usual for  them, in common life, to look farther than their nearest friends and
acquaintance. It is no less certain, that it is impossible for men to  consult, their interest in so
effectual a manner, as by an universal and  inflexible observance of the rules of justice, by which
alone they can  preserve society, and keep themselves from falling into that wretched  and savage
condition, which is commonly represented as the state of  nature. And as this interest, which all
men have in the upholding of  society, and the observation of the rules of justice, is great, so is


The following method simply prints the URI of the nearest vector, the matching score, and the text of the matching vector. 

In [5]:
nn.proof()

0 <http://franz.com/vdb/id/227> 0.8137894
private education of parents, contribute to the giving us a sense of  honour and duty in the strict
regulation of our actions with regard to  the properties of others.          SECT. VII OF THE ORIGIN
OF GOVERNMENT      Nothing is more certain, than that men are, in a great measure, governed  by
interest, and that even when they extend their concern beyond  themselves, it is not to any great
distance; nor is it usual for  them, in common life, to look farther than their nearest friends and
acquaintance. It is no less certain, that it is impossible for men to  consult, their interest in so
effectual a manner, as by an universal and  inflexible observance of the rules of justice, by which
alone they can  preserve society, and keep themselves from falling into that wretched  and savage
condition, which is commonly represented as the state of  nature. And as this interest, which all
men have in the upholding of  society, and the observation of the 

The following method of `add_neighbors_to_graph` creates a connection between each of the "neighbors". It does this with the following steps:
- creates a node for the request which stores the phrase, the minimum matching score, and when the query was run
- We create a blank node which stores the matching score and the index.
- Then we connect the blank nodes to the respective text chunk.

In [6]:
nn.add_neighbors_to_graph()

The following is an image generated from an old example and will look slightly different

![nearest neighbors](images/nearestneighbors.png)

In [7]:
nn = FindNearestNeighbors(conn, 'What is Human Nature', 'philosophy-vec', number=10, confidence=.2)

which will bear the examination of the latest posterity. For my part,  my only hope is, that I may
contribute a little to the advancement  of knowledge, by giving in some particulars a different turn
to the  speculations of philosophers, and pointing out to them more distinctly  those subjects,
where alone they can expect assurance and conviction.  Human Nature is the only science of man; and
yet has been hitherto the  most neglected. It will be sufficient for me, if I can bring it a little
more into fashion; and the hope of this serves to compose my temper  from that spleen, and
invigorate it from that indolence, which  sometimes prevail upon me. If the reader finds himself in
the same easy  disposition, let him follow me in my future speculations. If not, let  him follow his
inclination, and wait the returns of application and good  humour. The conduct of a man, who studies
philosophy in this careless  manner, is more truly sceptical than that of one, who feeling in
himself


In [8]:
nn.proof()

0 <http://franz.com/vdb/id/759> 0.8552525
which will bear the examination of the latest posterity. For my part,  my only hope is, that I may
contribute a little to the advancement  of knowledge, by giving in some particulars a different turn
to the  speculations of philosophers, and pointing out to them more distinctly  those subjects,
where alone they can expect assurance and conviction.  Human Nature is the only science of man; and
yet has been hitherto the  most neglected. It will be sufficient for me, if I can bring it a little
more into fashion; and the hope of this serves to compose my temper  from that spleen, and
invigorate it from that indolence, which  sometimes prevail upon me. If the reader finds himself in
the same easy  disposition, let him follow me in my future speculations. If not, let  him follow his
inclination, and wait the returns of application and good  humour. The conduct of a man, who studies
philosophy in this careless  manner, is more truly sceptical than tha

In [9]:
nn = FindNearestNeighbors(conn, 'Philosopher King', 'philosophy-vec', confidence=.8)

his error. In the state of which he would be the founder, there is no  marrying or giving in
marriage: but because of the infirmity of  mankind, he condescends to allow the law of nature to
prevail.    (c) But Plato has an equal, or, in his own estimation, even greater  paradox in reserve,
which is summed up in the famous text, ‘Until kings  are philosophers or philosophers are kings,
cities will never cease  from ill.’ And by philosophers he explains himself to mean those who  are
capable of apprehending ideas, especially the idea of good. To the  attainment of this higher
knowledge the second education is directed.  Through a process of training which has already made
them good citizens  they are now to be made good legislators. We find with some surprise  (not
unlike the feeling which Aristotle in a well-known passage  describes the hearers of Plato’s
lectures as experiencing, when they  went to a discourse on the idea of good, expecting to be
instructed in


### askMyDocuments SPARQL Query

This magic predicates will force chatGPT to read the topN nearest neighbors found by the function llm:nearestNeighbor and then give an answer using only the output of that function. The syntax of this magic predicate follows here, see also documentation <here>:
```
(?response ?score ?citation ?content) llm:askMyDocuments (?query ?vectorDatabase ?topN ?minScore)
```

In [15]:
query_string = """
    PREFIX llm: <http://franz.com/ns/allegrograph/8.0.0/llm/>
    select ?response ?score ?citation ?content where {
        (?response ?score ?citation ?content) llm:askMyDocuments ("cite two opposing views on government" "philosophy-vec" 10 .5) }"""
with conn.executeTupleQuery(query_string) as result:
    df = result.toPandas()
df.head()

Unnamed: 0,response,score,citation,content
0,One view on government advocates for resistanc...,0.82441,<http://franz.com/llm/TreatiseOfHumanNature_1252>,"necessity of self-preservation, and the same m..."
1,One view on government advocates for resistanc...,0.82202,<http://franz.com/llm/TheRepublic_1125>,"When I take a country walk, he said, I often e..."


We have created another class as an example that shows some possible functionality. Again, the code for this can be found in `llm_utils.py`. The creation of a `AskMyDocuments` class always prints the response for ease of use in this notebook. The arguments are as follows:
- `conn` - the connection object
- `question` - the question to your documents
- `vector_db` - the vector database where indexed text is stored
- `number` - the maximum number of responses
- `confidence` - the minimum matching score

In [16]:
response = AskMyDocuments(conn, 'cite two opposing views on government', 'philosophy-vec', number=20, confidence=0)

One view on government advocates for resistance against tyranny and oppression, emphasizing the
right to defend against encroachment on constitutional bounds. Another view highlights the dangers
of democracy leading to anarchy and the importance of a strong, stable government.


The `df` method simple gives the user access to the complete response of the SPARQL query.

In [17]:
response.df

Unnamed: 0,citation,content,response,score
0,<http://franz.com/llm/TreatiseOfHumanNature_1252>,"necessity of self-preservation, and the same m...",One view on government advocates for resistanc...,0.82441
1,<http://franz.com/llm/TheRepublic_301>,citizens. Now there are occasions on which the...,One view on government advocates for resistanc...,0.816725


Similar to the `llm:nearestNeigbor` example we also created a similar function that links all evidence chunks of the response to a newly created response object that stores the metadata of the `AskMyDocuments` class.

In [18]:
response.add_evidence_to_graph()

![ask my documents](images/ask_my_documents.png)

In [20]:
response = AskMyDocuments(conn, 'What is the purpose of humanity', 'philosophy-vec')

The purpose of humanity is to strive for moral excellence and to contribute to the progress of
civilization through intelligence and enlightenment.


In [21]:
response = AskMyDocuments(conn, "What state are humans naturally in", 'philosophy-vec')

Humans are naturally in a state of perfect freedom, equality, and reciprocity, governed by the law
of nature and reason, until they choose to join a political society.


# Asking Questions of a Contract

In this example we will ask questions of a contract and show the section(s) of the contract where the answers can be found. [Here](https://sccrtc.org/wp-content/uploads/2010/09/SampleContract-Shuttle.pdf) is a link the contract in question.

We start by creating a new repository and adding the previously parsed contract triples (please add your parameters above to connect to the repository)

In [22]:
conn = ag_connect('contracts')
conn.addFile('contract.nt')
conn.size()

342

![contract gruff image](images/contract-compensation.png)

Now we have to index the text again. Remember that there are two methods two do this: Webview and `agtool` also possible to embed via Webview.

### Embedding via Webview

To embed via WebView navigate to `Repository Control` in the left column under `Repository`. Then select `Create LLM Embedding`. In this page you can:
* Create or open a vector store
* Choose you embedder and model, and add your api key
* Decide which predicates/types to include or exlude from the embedding. In this example we will embed all text that is the object of the triple with predicate `<http://franz.com/llm/text>`

![webview-embedding](images/webview-llm-embed-contract.png)

### Embedding via agtool

Remember that this method will not work if you are on an AllegroGraph cloud server.

We define `contract.def` as follows:
```
gpt
 api-key "your-openai-api-key-here"
 vector-database-name localhost:10035/contract-vec
 embedder openai
 limit 1000000
 include-predicates <http://franz.com/hasContent>
```

And then again we run the `agtool llm index` command:

```shell
agtool llm index localhost:10035/contracts contract.def 
```

Once all text has been indexed we can start asking the document questions!

In [25]:
response = AskMyDocuments(conn, 'Can we pay the consultant a bonus?', 'contract-vec')

Yes, the consultant can receive a bonus based on satisfactory services provided and actual allowable
incurred costs. Progress payments will be made monthly with a pro rata portion of the fixed fee
included. The consultant must submit progress reports with each invoice to track performance and
ensure expectations are met.


In [26]:
response.proof()

0 0.7976736 <http://franz.com/2._E.>
 E. Progress payments will be made no less than monthly in arrears based on satisfactory services
provided and actual allowable incurred costs. A pro rata portion of the CONSULTANTs fixed fee, if
applicable, will be included in the monthly progress payments. If CONSULTANT fails to submit the
required  Page 2 deliverable items according to the schedule set forth in the Scope of Services, the
COMMISSION may delay payment and/or terminate this Agreement in accordance with the provisions of
Section 4 of this Agreement.

1 0.7959859 <http://franz.com/1._D._1)>
 1) The CONSULTANT shall submit written progress reports with each invoice. The report should be
sufficiently detailed for the Contract Manager to determine if the CONSULTANT is performing to
expectations or is on schedule; to provide communication of interim findings; and to sufficiently
address any difficulties or special problems encountered, so remedies can be developed.



In [27]:
response.add_evidence_to_graph()

![ask my contract](images/ask_my_contract.png)

In [29]:
response = AskMyDocuments(conn, "What should the consultant submit with each invoice?", "contract-vec")

The consultant should submit written progress reports with each invoice, detailing performance,
communication of findings, addressing difficulties, and special problems encountered.


In [30]:
response = AskMyDocuments(conn, "A third party sued the contractor and tried to collect money from the city.", "contract-vec")

The contractor is liable to indemnify, defend, and hold harmless the city from any legal actions
brought by third parties seeking to collect money from the city.


In [31]:
response.proof()

0 0.77626973 <http://franz.com/5.>
 5. INDEMNIFICATION FOR DAMAGES, TAXES AND CONTRIBUTIONS. CONSULTANT shall exonerate, indemnify,
defend, and hold harmless the COMMISSION (which for the purpose of this Agreement shall include,
without limitation, its officers, agents, employees and volunteers) from and against:

1 0.7704964 <http://franz.com/4._B.>
 B. COMMISSION may terminate this Agreement for CONSULTANT's default if a federal or state
proceeding for the relief of debtors is undertaken by or against CONSULTANT, or CONSULTANT's  Page 3
principal, or if CONSULTANT or CONSULTANT's principal makes an assignment for the benefit of
creditors, or if CONSULTANT breaches any term(s) or violates any provision(s) of this Agreement and
does not cure such breach or violation within ten (10) days after written notice thereof by
COMMISSION. CONSULTANT shall be liable for any and all reasonable costs incurred by COMMISSION as a
result of such default, including but not limited to reprocurement cos