# Semantic search index creation 

Anton Antonov  
April 2025  

-----

## Introduction

This notebook shows how to create an LLM-computed vector database over the paragraphs of relatively large text.

Here is the Retrieval Augmented Generation (RAG) workflow we consider:

- The document collection is ingested.
- The documents are split into chunks of relevant sizes.
- Large Language Model (LLM) embedding vectors are obtained for all chunks.
- A vector database is created with these embedding vectors and stored locally. Multiple local databases can be created.
- A relevant local database is imported for use.
- An input query is provided to a retrieval system.
- The retrieval system retrieves relevant documents based on the query.
- The top K documents are selected for further processing.
- The model is fine-tuned using the selected documents.
- The fine-tuned model generates an answer based on the query.
- The output answer is presented to the user.

Here is a Mermaid-JS component diagram that shows the components of performing the Retrieval Augmented Generation (RAG) workflow:

-------

## Setup

In [1]:
use Data::Importers;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use JSON::Fast;

In [2]:
my %h = num-type => num32.^name;

to-json(%h)

{
  "num-type": "num32"
}

------

## Ingest text

Get training texts file names:

In [19]:
my @dirParts = <DialogueWorks/Wolff-Hudson GlennDiesen/Wolff-Hudson GlennDiesen/Sachs RobinsonErhardt/Wolff-Hudson>;

my @fileNames;
for @dirParts -> $p {
    my $dirName = ($*CWD ~ '/texts/' ~ $p).subst('/notebooks/Jupyter');
    @fileNames .= append(dir($dirName).grep(*.ends-with('.txt')));
}

@fileNames.elems

76

Ingest the texts:

In [20]:
my %texts = @fileNames.map({ $_.basename => slurp($_)});
deduce-type(%texts)

Assoc(Atom((Str)), Atom((Str)), 76)

Show text sizes summary:

In [21]:
sink records-summary(%texts.values.deepmap(*.&text-stats)».Hash)

+-----------------------+----------------------+------------------------+
| words                 | lines                | chars                  |
+-----------------------+----------------------+------------------------+
| Min    => 1162        | Min    => 1          | Min    => 6498         |
| 1st-Qu => 5554.5      | 1st-Qu => 1          | 1st-Qu => 30211.5      |
| Mean   => 8286.460526 | Mean   => 217.776316 | Mean   => 44237.315789 |
| Median => 7690.5      | Median => 1          | Median => 41127        |
| 3rd-Qu => 9411        | 3rd-Qu => 1          | 3rd-Qu => 50060        |
| Max    => 33124       | Max    => 4760       | Max    => 175186       |
+-----------------------+----------------------+------------------------+


Show a sample of the texts:

In [22]:
#% html
%texts.head(4)
==> { .Hash.deepmap(*.substr(0..120)) }()
==> to-html()

0,1
0lD_UrtPVpA.txt,hi everybody today's th February 13 2025 and our friends Michael Hudson and Richard Wolff are back with us welcome back g
e3qLvgyRSKc.txt,how do you see what are the main reasons of what's happening with the stock market okay um let me Begin by reminding ever
594yN8rxIJo.txt,hi everybody today is Thursday April 3r 25 and our friends Michael Hudson and Richard W are back with us welcome back tha
iS1HQq-29cU.txt,hi everybody today is Thursday February 27 2025 and our friends Richard wol and Michael Hudson are back with us welcome b


------

## Make vector database

**Remark:** The vector database can be made by just specifying the directory with text files. Here we use "low level" approach in order to experiment with different text modifications.

Make an empty vector database object:

In [23]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new(name => 'EconomicsAI');

VectorDatabase(:id("c4cd86f8-8908-44a4-8a39-9f9e476cba05"), :name("EconomicsAI"), :elements(0), :sources(0), :precision(num64))

Make an LLM access specification:

In [38]:
#my $conf = llm-configuration("ChatGPT", model => 'text-embedding-002');
#my $conf = llm-configuration("Gemini");
my $conf = llm-configuration(
    Whatever,
    name => 'LLaMA', 
    model => 'llama-embedding',
    base-url => 'http://127.0.0.1:8080',
    embedding-model => 'llama-embedding',
    embedding-function => &WWW::LLaMA::Embeddings::LLaMAEmbeddings,
    module => 'WWW::LLaMA'
    );
$conf.Hash.elems

24

Create the semantic index for the vector database object (an profile it):

In [25]:
my $tstart = now;
$vdbObj.create-semantic-search-index(%texts, method => 'by-max-tokens', max-tokens => 1024, e => $conf, :!batched):embed;
my $tend = now;
say "Time to make the semantic search index: {$tend - $tstart} seconds.";

Time to make the semantic search index: 2153.971749513 seconds.


By default the vector database object is exported in a sub-directory of [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [27]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = "SemSe-{$vdbObj.id}.json";

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

True

The export path is saved in the vector database object:

In [28]:
$file.Str eq $vdbObj.location

True

Show a sample of the text chunks:

In [29]:
#% html
$vdbObj.items.pairs.pick(4).sort(*.key) ==> to-html()

0,1
FSA6l97_g8k.txt.6,countries that are and are that are now subject to the ins sanctions that the American government itself has been putting on now how do you reconcile the fact that the American sanctions that are imposed by the neoliberal the neocons and the neoliberals are a against the profit uh search by the leading American uh sectors Information Technology sectors uh the you could go right to the car manufacturers all the others you could say that the the the sanctions end up penalizing the US economy much more than other countries because while other countries have a short-term Interruption of their supply they have long-term Independence and for America this uh long-term effect and even the short-term effect is to uh take away from the American exporters uh the the leading industrial sectors certainly on the stock exchange to take away this Market it's lost so what the Americans are doing is isol self ol ating themselves uh we all thought for years that somehow the global majority was going to uh get together and draft a means of becoming independent and helping their own economic interests but it's the United States that is driving this uh ironically not uh China not Russia uh not these other countries they're reacting uh to the US that is uh essentially committing policies that are economically suicidal yeah well I've noticed that too if you read for example the the statements periodically put out by the United States Chamber of Commerce you get what Michael is talking about they're very ner they don't want this fighting with China they they represent large number of Corporations who have put large amounts of investment inside China they don't want to lose those uh China is the biggest fastest growing Market in the world nobody wants to be excluded every business school teaches you want to make a lot of money you go to where the wages are cheap and the market is growing hello that's these other parts of the world that's where all that is going on and and that's going to outcompete the West sooner or later they say all of that so Michael's question stands what's going on and here's the best that I can do I'm guessing and I'm hoping you or your audience sets me if I'm making a mistake they really don't see what we're talking about in other words when I said earlier a bit mockingly they they live in the 1960s and 70s when the dominance of the United States was real maybe there's more truth to that than my mockery would leave anyone to understand that they really do believe that this is a temporary momentary challenge which

0,1
Uz5-PtkUw9s.txt.11,do is to look at China and no matter how many times the Chinese tell you we have two goals by the way they've been saying this for 50 years we have two goals number one to end a 100 Years of humiliation by which they mean colonialism because even though China as a whole never became a colony parts of it did the the cities along the coast were taken over some by the Germans some by the British it was horrible and they they fought the box of rebellion and they were defeated and all the rest the second goal of China was to raise its people out of the worst poverty the world has ever seen two goals not to be humiliated by foreigners and to raise their standard of living basically okay that's what they set out to do and they have been the most successful in doing that in the history of the world if you measure the amount of improvement and the time it took to achieve it by those standards they are a roaring success notice I'm not commenting on their internal civil liberties or a whole lot of other qualities that are another conversation the chin but for the United States it cannot see what they're doing or why they're doing it they don't anymore have the link of a great struggle between capitalism and socialism cuz that really doesn't fit anymore so they have it between democracy and authoritarianism which has no more pull or power of analysis than the old capitalism versus socialism ever did these are ways of handling the rationalization that the United States needs to achieve what for it has become security if you become a world power then security requires you to control the world if you don't want to be worried about the rest of the world and don't be as a world power be a real strong power where the hell you are but the United States has 7 to 800 bases around the country that's the aspirations of a world power and now it has the problem how do you rationalize wanting to be perpetually what no Empire has achieved answer everybody else is a threat to all that is good in the world it is either nonhuman or a real bad civilization or author Arian last point the the irony here which either a Hegel as philosopher or a barold brush uh as a theater writer or a George Carlin as a comedian how you need that level of Brilliance to capture the most authoritarian political structure exists inside every capitalist Corporation the CEO tells everybody else what to do and the people he orders about the employees have absolutely no recall over him whatsoever they don't vote for him they don't approve anything he does if they don't if

0,1
hzIft13pASY.txt.0,let's start with your latest article on Empire Decline and costly delusions you talk about this conflict in Ukraine and the complexity of this issue that we witnessing with Russia the West NATO and could you elaborate on this article because I found it so important for our audience well first of all thank you for your kind words about it um I wrote it because of my concern that the level of miscalculations involved in this war have to be explained I mean there are many aspects of this war and I know you have explored them with various guests and and you will continue to do so and that that's an important service but I want to uh pose and answer a question question and the question is there were terrible miscalculations made and I want to uh explore them let me Begin by explaining what I mean um early in the war and even before the War uh Russia had indicated that it considered this a threat to its security and um safety the possibility that Ukraine would join the economic Union the European Union and also NATO uh and then when they made the decision uh to invade which they did in February of uh of 2022 we know that there was um there were statements there were meetings and there were negotiations early in the war to bring it to an end before it could do terrible damage either to Ukraine or to Russia or to both of them uh but those failed and we know that there was a a role played by British prime minister Boris Johnson who went to Kiev and persuaded zalinski or at least that's the best sense I can make of it not to sign an agreement not to work something out uh with the Russians um okay then the war begins in Earnest and I know from the United States and from others countries too that the following miscalculations were made and articulated number one that Russia was weak and could be defeated militarily even by a much smaller country Ukraine if it got support from Western Europe and America number two that one of the objectives was to bring I quote bring Russia to its knees and or quote break Russia up into smaller uh National units like Ukraine for example and so on uh then the calculation was made that the Russian Ruble would collapse then the calculation was made that the Russian economy would collapse it could not sustain for any length of time uh a war I could go on but these were all shown to be huge miscalculations why because Russia showed that it can wage a war of attrition that it can do so better than the Ukraine and it can do it better than the Ukraine with Western Arms and Western financial support

0,1
qcTLAX8hF7I.txt.3,"other ideologies of the 20th century that were based on superiority of race or religion uh or some other attribute of European culture that gave the right to dominate. And this is where uh so much goes wrong uh in our uh thinking. Uh these beliefs become deeply embedded. Uh maybe not explicit, maybe they're even denied. uh after a certain point, but they're embedded uh in the way that our governments, our states approach international issues. Was it Emanuel's K's 300y year birthday last year in Khalinrad and it did strike me some of the universalism behind it. There was some sense of superiority. But what's interesting with Samuel Hunting, he made this point that uh um that the western countries we we tend to believe that we've been ruling the world for these centuries because of our superior ideas and values and ideology. But he made a point that it's really the mastered more efficient weaponry, controlled the sea lanes and got a head start in the industrial revolution. And yeah, the rest of the world do not uh do do not ignore this that it was our organized violence. But uh but how but it begs the question how will this lead to a different rule by China? Because if if the euroentric world was you all these empires on a small continent um how this kind of formed our way of looking at the world because our political theorists tend to assume that you have this unavoidable geopolitical rivalry that is almost a law of nature. How do you think China would be different? its geopolitical mentality would be different than for example the way the Europeans have behaved because we always assumed it's all universal. This is also an absolutely fascinating and much debated question I would say but I have my own views which I'll share and that is that the decisive event in western political culture if I may put it that way is 476 AD uh which is the collapse of the western Roman Empire when Rome is conquered by German Germanic conquerors and at that point it was a long process but with the fall of the western Roman Empire Europe uh fragmented into multiple political entities in fact a complete complex uh remarkable kaleidoscope of political entities of citystates and and kingdoms and dupdoms and uh every conceivable form of uh political organization from the Chinese level up through would be Europeanwide empires that never quite reached their billing like the Holy Roman Empire that Charlemagne in in effect began. But it, as uh was famously said, it was never quite holy, never Roman, and never an empire uh in in the way that it"


Show dimensions and data type of the obtained vectors:

In [30]:
say "dimensions : ", $vdbObj.vectors.&dimensions;
say "data type  : ", deduce-type($vdbObj.vectors);

dimensions : (1352 2048)
data type  : Assoc(Vector(Atom((Str)), 1352), Tuple([(Any) => 1352], 1352), 1352)


-----

## Summaries

Skim vector databases from the default directory and show summaries:

In [34]:
#% html
my @field-names = <id name item-count dimension version llm-service llm-embedding-model created>;
vector-database-objects(f=>'hash', :flat)
==> { .head(3) }()
==> { $_.map({ $_<created> = $_<file>.IO.created.DateTime.Str.subst('T',' ').substr(^19); $_}).sort(*<created>).reverse }()
==> to-html(:@field-names)

id,name,item-count,dimension,version,llm-service,llm-embedding-model,created
c4cd86f8-8908-44a4-8a39-9f9e476cba05,EconomicsAI,1352,2048,0,llama,llama-embedding,2025-05-03 23:00:23
e8042e57-9fea-43fe-b603-b4f4a31a67e1,FSMComands,35,1536,0,chatgpt,text-embedding-3-small,2024-10-02 03:16:43
5097c865-7fad-43d9-b7ca-456220d754b7,words,20,768,0,gemini,embedding-001,2024-09-18 14:08:44


-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::LLaMA Raku package](https://github.com/antononcube/Raku-WWW-LLaMA),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Video channels

[GDc1] Glenn Diesen,
["The Greater Eurasia Podcast"](https://www.youtube.com/@GDiesen1),
(2011-2025).

[NAc1] Nima Alkhorshid,
["Dialogue works"](https://www.youtube.com/@dialogueworks01),
(2021-2025).

[REc1] Robinson Erhard,
["Robinson's Podcast"](https://www.youtube.com/@robinsonerhardt),
(2022-2025).