# Semantic search index creation 

Anton Antonov  
April 2025  

-----

## Introduction

This notebook shows how to create an LLM-computed vector database over the paragraphs of relatively large text.

Here is the Retrieval Augmented Generation (RAG) workflow we consider:

- The document collection is ingested.
- The documents are split into chunks of relevant sizes.
- Large Language Model (LLM) embedding vectors are obtained for all chunks.
- A vector database is created with these embedding vectors and stored locally. Multiple local databases can be created.
- A relevant local database is imported for use.
- An input query is provided to a retrieval system.
- The retrieval system retrieves relevant documents based on the query.
- The top K documents are selected for further processing.
- The model is fine-tuned using the selected documents.
- The fine-tuned model generates an answer based on the query.
- The output answer is presented to the user.

Here is a Mermaid-JS component diagram that shows the components of performing the Retrieval Augmented Generation (RAG) workflow:

-------

## Setup

In [None]:
use Data::Importers;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use JSON::Fast;

In [None]:
my %h = num-type => num32.^name;

to-json(%h)

------

## Ingest text

Get training texts file names:

In [None]:
my $dirName = ($*CWD ~ '/texts/DialogueWorks/Wolff-Hudson').subst('/notebooks/Jupyter');
my @fileNames = dir($dirName).grep(*.ends-with('.txt'));

$dirName = ($*CWD ~ '/texts/RobinsonErhardt/Wolff-Hudson').subst('/notebooks/Jupyter');
@fileNames .= append(dir($dirName).grep(*.ends-with('.txt')));

@fileNames.elems

Ingest the texts:

In [None]:
my %texts = @fileNames.map({ $_.basename => slurp($_)});
deduce-type(%texts)

Show text sizes summary:

In [None]:
sink records-summary(%texts.values.deepmap(*.&text-stats)».Hash)

Show a sample of the texts:

In [None]:
#% html
%texts.head(4)
==> { .Hash.deepmap(*.substr(0..120)) }()
==> to-html()

------

## Make vector database

**Remark:** The vector database can be made by just specifying the directory with text files. Here we use "low level" approach in order to experiment with different text modifications.

Make an empty vector database object:

In [None]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new(name => 'EconomicsAI');

Make an LLM access specification:

In [None]:
#my $conf = llm-configuration("ChatGPT", model => 'text-embedding-002');
my $conf = llm-configuration("Gemini");
#my $conf = llm-configuration('LLaMA', model => 'llama-embedding');

$conf.Hash.elems

Create the semantic index for the vector database object (an profile it):

In [None]:
my $tstart = now;
$vdbObj.create-semantic-search-index(%texts, method => 'by-max-tokens', max-tokens => 1024, e => $conf):embed;
my $tend = now;
say "Time to make the semantic search index: {$tend - $tstart} seconds.";

By default the vector database object is exported in a sub-directory of [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [None]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = "SemSe-{$vdbObj.id}.json";

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

The export path is saved in the vector database object:

In [None]:
$file.Str eq $vdbObj.location

Show a sample of the text chunks:

In [None]:
#% html
$vdbObj.items.pairs.pick(4).sort(*.key) ==> to-html()

Show dimensions and data type of the obtained vectors:

In [None]:
say "dimensions : ", $vdbObj.vectors.&dimensions;
say "data type  : ", deduce-type($vdbObj.vectors);

-----

## Summaries

Skim vector databases from the default directory and show summaries:

In [None]:
#% html
my @field-names = <id name item-count dimension version llm-service llm-embedding-model created>;
vector-database-objects(f=>'hash', :flat)
==> { $_.map({ $_<created> = $_<file>.IO.created.DateTime.Str.subst('T',' ').substr(^19); $_}).sort(*<created>).reverse }()
==> to-html(:@field-names)

-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Why Does The Modern World Make No Sense? (4K)"](https://www.youtube.com/watch?v=p_swB_KS8Hw),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/747-eric-weinstein-why-does-the-modern-world-make-no-sense).)

[CWv2] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)