# Self-Generative Systems (SGS) with an integrated Large Language Model (LLM). 

# 集成大型语言模型（LLM）的自生成系统（SGS）

本次演示介绍了一种高效且不显眼的方法，将AI模型，特别是大型语言模型（LLMs），以即插即用的方式集成到SGS框架中。此集成的一个关键特征是共享的分词过程，该过程为元数据模型（MM）和LLMs提供输入。这种共享输入确保两个组件处理相同的基础数据，从而通过保证效率和连贯性来提高系统性能。

该解决方案涉及增强特定LLM的分词器——在我们的案例中是GPT2Tokenizer。

在ExtendedGPT2Tokenizer中，有两个方法负责这些任务：

- `tokenize_text(text)`: 该方法对输入文本进行分词。
- `update_tensors(token_ids)`: 该方法根据令牌ID更新保存元模型HllSets数据的张量。


In [None]:
using PyCall
using DataFrames

# Import the fine_tune_model and parse_decoded_strings functions from the Python script
py"""
import sys
sys.path.append(".")
from SGS_Tokenizer import ExtendedGPT2Tokenizer, GPT2Tokenizer, GPT2LMHeadModel
"""

function tensor_to_array(tensor::PyObject)
    # Convert PyObject to Julia Vector{UInt32}
    hllset_array = pycall(tensor.numpy, PyArray)
    hllset_vector = Vector{Int64}(hllset_array)

    return hllset_vector
end


### 张量更新过程

1. 获取令牌ID：
```python
token_ids = tokenizer.tokenize_text(text)
```

2. 更新张量：
```python
new_tensor_1, double_value = tokenizer.update_tensors(token_ids)
```

这种简化的方法确保AI模型在SGS框架中的集成既无缝又有效。

In [None]:
text = "When the distance between two unit-length vectors is defined to be the length of their vector difference then"

vocab_file = "JLD2/vocab.json"      # Path to the vocab file
merges_file = "JLD2/merges.txt"     # Path to the merges file

tokenizer = py"ExtendedGPT2Tokenizer"(vocab_file, merges_file, p=4)

# text = "When the distance between two unit-length vectors is defined to be the length of their vector difference then"

# Update tensors
token_ids = tokenizer.tokenize_text(text)
new_tensor_1, double_value = tokenizer.update_tensors(token_ids)

# println("new_tensor_1:", new_tensor_1)
# println("double_value:", double_value)

id, sha1, hll_tensor = tokenizer.tensor_to_hlltensor(new_tensor_1)
println("HLLSet:", id, "; ", sha1, "; ", hll_tensor)

hll_vector = tensor_to_array(hll_tensor)

# println("HLLSet (Vector{Int64}):", hll_vector)

tensor_slice = tokenizer.hlltensor_to_tensor(hll_tensor)
println("Tensor Slice:", tensor_slice)

# tokenizer.print_tensor_1(tensor=tokenizer.tensor_1)

In [None]:
tokenizer.print_tensors()

在这个阶段，我们已经成功地将文本集成到大型语言模型（LLM）和元数据模型（MM）中，使得这两个模型中的令牌可以进行搜索和检索。

为了从MM中检索与查询相关的所有HllSets，我们遵循以下步骤：

1. 我们根据查询文本创建一个HllSet，使用我们之前采用的相同分词过程。
2. 然后，我们在MM中搜索与查询HllSet相似的所有HllSets，使用余弦相似度来识别满足指定阈值的HllSets。
3. 最后，利用令牌哈希到令牌ID的映射，我们用获取的令牌ID查询LLM。结果令牌将代表一组相关的令牌。

In [None]:
# Perform search
query = "When the distance between two unit-length vectors is defined"

threshold = 0.1
related_hllsets = pycall(tokenizer.search, PyObject, query, threshold)
println("Related HLL sets: ", related_hllsets)

# Get related tokens
related_tokens = pycall(tokenizer.get_related_tokens, PyObject, related_hllsets)
# println("Related tokens: ", related_tokens)

我们现在可以请求LLM提供与提取的相关令牌一致的建议。

In [None]:
# Generate meaningful text
suggestions = []
try    
    suggestions = pycall(tokenizer.generate_text, PyObject, related_tokens, 3)
catch e
    println("Error generating text suggestions: ", e)
end


println(tokenizer.format_generated_texts(suggestions))

# tokenizer.print_tensors()

我们假设元模型中的HllSets以某种方式组织。一个常见的组织方法是使用余弦相似度等技术将HllSets分组到相关社区中。这使我们能够评估生成的建议，并根据我们的兴趣选择最相关的建议。

In [None]:
# Evaluate generated texts
communities = [
    
    "4da9de3e80bcb65ee6169a411b0206ce45ba68bc"
    ]  # Load or define your communities of HLL sets here

evaluation_results = pycall(tokenizer.evaluate_generated_texts, PyObject, suggestions, communities)
println("Evaluation results: ", tokenizer.format_generated_texts(evaluation_results))