# Self-Generative Systems (SGS) with an integrated Large Language Model (LLM). 

This presentation outlines an efficient and unobtrusive method for integrating AI models, specifically Large Language Models (LLMs), into the SGS framework in a plug-and-play manner. A central aspect of this integration is the shared tokenization process, which supplies input to both the Metadata Models (MM) and the LLMs. This shared input guarantees that both components process the same foundational data, thereby enhancing system performance through improved efficiency and coherence.

The solution focuses on enhancing the tokenizer of the specific LLM, which in our case is the GPT2Tokenizer.

Within the ExtendedGPT2Tokenizer, two key methods are responsible for these tasks:

- `tokenize_text(text)`: This method tokenizes the input text.
- `update_tensors(token_ids)`: This method updates the tensors, which hold Meta Model HllSets data, based on the token IDs.


In [1]:
using PyCall
using DataFrames

# Import the fine_tune_model and parse_decoded_strings functions from the Python script
py"""
import sys
sys.path.append(".")
from SGS_Tokenizer import ExtendedGPT2Tokenizer, GPT2Tokenizer, GPT2LMHeadModel
"""

function tensor_to_array(tensor::PyObject)
    # Convert PyObject to Julia Vector{UInt32}
    hllset_array = pycall(tensor.numpy, PyArray)
    hllset_vector = Vector{Int64}(hllset_array)

    return hllset_vector
end

tensor_to_array (generic function with 1 method)

### Tensor Update Process

1. Obtain token IDs:

```python
token_ids = tokenizer.tokenize_text(text)
```

2. Update tensors:

```python
new_tensor_1, double_value = tokenizer.update_tensors(token_ids)
```

This streamlined approach ensures that the integration of AI models into the SGS framework is both seamless and effective.

In [2]:
text = "When the distance between two unit-length vectors is defined to be the length of their vector difference then"

vocab_file = "JLD2/vocab.json"      # Path to the vocab file
merges_file = "JLD2/merges.txt"     # Path to the merges file

tokenizer = py"ExtendedGPT2Tokenizer"(vocab_file, merges_file, p=4)

# text = "When the distance between two unit-length vectors is defined to be the length of their vector difference then"

# Update tensors
token_ids = tokenizer.tokenize_text(text)
new_tensor_1, double_value = tokenizer.update_tensors(token_ids)

# println("new_tensor_1:", new_tensor_1)
# println("double_value:", double_value)

id, sha1, hll_tensor = tokenizer.tensor_to_hlltensor(new_tensor_1)
println("HLLSet:", id, "; ", sha1, "; ", hll_tensor)

hll_vector = tensor_to_array(hll_tensor)

# println("HLLSet (Vector{Int64}):", hll_vector)

tensor_slice = tokenizer.hlltensor_to_tensor(hll_tensor)
println("Tensor Slice:", tensor_slice)

# tokenizer.print_tensor_1(tensor=tokenizer.tensor_1)

HLLSet:1; 4da9de3e80bcb65ee6169a411b0206ce45ba68bc; PyObject tensor([ 2, 18, 12,  2,  2, 28,  0,  8,  0, 22,  0,  0,  0,  2,  0,  2])
Tensor Slice:PyObject tensor([[1.0220e+03, 4.0546e+07, 1.0000e-01],
        [1.2000e+01, 1.6594e+08, 1.0000e-01],
        [7.3400e+02, 3.3153e+08, 1.1000e+00],
        [3.0104e+04, 5.2051e+08, 1.1000e+00],
        [5.1100e+02, 4.0169e+08, 1.1000e+00],
        [2.6200e+02, 3.9209e+08, 1.4000e+00],
        [2.6200e+02, 3.9209e+08, 1.4000e+00],
        [3.1800e+02, 5.7578e+08, 2.2000e+00],
        [3.5800e+03, 6.3241e+08, 2.3000e+00],
        [7.8800e+02, 9.3641e+08, 3.1000e+00],
        [4.1290e+03, 1.1767e+09, 4.1000e+00],
        [2.8400e+02, 1.3680e+09, 5.2000e+00],
        [4.3260e+03, 1.4954e+09, 5.3000e+00],
        [1.5879e+04, 1.5664e+09, 5.4000e+00],
        [3.0700e+02, 1.8849e+09, 7.3000e+00],
        [5.2530e+03, 2.4650e+09, 9.1000e+00],
        [1.3664e+04, 2.6029e+09, 9.2000e+00],
        [5.4470e+03, 2.5253e+09, 9.4000e+00],
        [2.2150e

In [3]:
tokenizer.print_tensors()

At this stage, we have successfully integrated text into both the Large Language Model (LLM) and the Metadata Model (MM), making the tokens accessible for search and retrieval from both models.

To retrieve all HllSets related to the query from the MM, we follow these steps:

1. We create an HllSet based on the query text, utilizing the same tokenization process we employed previously.
2. We then search for all HllSets in the MM that are similar to the query HllSet, using cosine similarity to identify those that meet a specified threshold.
3. Finally, leveraging the token hash-to-token ID mapping, we query the LLM with the obtained token IDs. The resulting tokens will represent a collection of related tokens.


In [4]:
# Perform search
query = "When the distance between two unit-length vectors is defined"

threshold = 0.1
related_hllsets = pycall(tokenizer.search, PyObject, query, threshold)
println("Related HLL sets: ", related_hllsets)

# Get related tokens
related_tokens = pycall(tokenizer.get_related_tokens, PyObject, related_hllsets)
# println("Related tokens: ", related_tokens)

Related HLL sets: PyObject [0]


PyObject [' between', '-', ' two', ' vectors', ' their', ' the', ' the', ' is', ' difference', ' then', ' length', ' to', ' unit', ' vector', ' be', ' distance', 'length', ' defined', 'When', ' of']

We can now request suggestions from the LLM that align with the extracted related tokens.

In [5]:
# Generate meaningful text
suggestions = []
try    
    suggestions = pycall(tokenizer.generate_text, PyObject, related_tokens, 3)
catch e
    println("Error generating text suggestions: ", e)
end


println(tokenizer.format_generated_texts(suggestions))

# tokenizer.print_tensors()


Generated text suggestions:
Suggestion 1:
-lengthWhenThis article is about a character, it is not about them. For other uses of the term "character", see Character (disambiguation).

Suggestion 2:
-lengthWhenThis article is about a character, it is not about them. For other uses of the term "character", see Character (disambiguation)

Suggestion 3:
-lengthWhenThis article is about a character, it is not about them. For other uses of the term "character", see Character.

"I'm




We assume that the HllSets in the Meta Model are organized in some manner. One common approach to this organization is to group HllSets into related communities using techniques like cosine similarity. This allows us to evaluate the generated suggestions and select the most relevant ones based on our interests.


In [6]:
# Evaluate generated texts
communities = [
    
    "4da9de3e80bcb65ee6169a411b0206ce45ba68bc"
    ]  # Load or define your communities of HLL sets here

evaluation_results = pycall(tokenizer.evaluate_generated_texts, PyObject, suggestions, communities)
println("Evaluation results: ", tokenizer.format_generated_texts(evaluation_results))

Evaluation results: 
Generated text suggestions:
Suggestion 1:
('-lengthWhenThis article is about a character, it is not about them. For other uses of the term "character", see Character (disambiguation).', '4da9de3e80bcb65ee6169a411b0206ce45ba68bc', np.float64(0.21087891921820445))

Suggestion 2:
('-lengthWhenThis article is about a character, it is not about them. For other uses of the term "character", see Character (disambiguation)', '4da9de3e80bcb65ee6169a411b0206ce45ba68bc', np.float64(0.2988664113443373))

Suggestion 3:
('-lengthWhenThis article is about a character, it is not about them. For other uses of the term "character", see Character.\n\n"I\'m', '4da9de3e80bcb65ee6169a411b0206ce45ba68bc', np.float64(0.29280576695787364))


