# Semantic search index creation 

### *Guide*

Anton Antonov    
September 2024  

-----

## Introduction

This notebook shows how to create an LLM-computed vector database over the paragraphs of relatively large text.

Here is the Retrieval Augmented Generation (RAG) workflow we consider:

- The document collection is ingested.
- The documents are split into chunks of relevant sizes.
- Large Language Model (LLM) embedding vectors are obtained for all chunks.
- A vector database is created with these embedding vectors and stored locally. Multiple local databases can be created.
- A relevant local database is imported for use.
- An input query is provided to a retrieval system.
- The retrieval system retrieves relevant documents based on the query.
- The top K documents are selected for further processing.
- The model is fine-tuned using the selected documents.
- The fine-tuned model generates an answer based on the query.
- The output answer is presented to the user.

Here is a Mermaid-JS component diagram that shows the components of performing the Retrieval Augmented Generation (RAG) workflow:

```mermaid
flowchart TD
    subgraph LocalVDB[Local Folder]
        A(Vector Database 1)
        B(Vector Database 2)
        C(Vector Database N)
    end
    ID[Ingest document collection]
    SD[Split Documents]
    EV[Get LLM Embedding Vectors]
    CD[Create Vector Database]
    ID --> SD --> EV --> CD

    CD -.- CArray[[CArray<br>representation]]

    CD -.-> |export| LocalVDB

    subgraph Creation
        ID
        SD
        EV
        CD
    end

    LocalVDB -.- JSON[[JSON<br>representation]]

    LocalVDB -.-> |import|D[Ingest Vector Database]
 
    D -.- CArray
    F -.- |nearest neighbors<br>distance function|CArray
    D --> E
    E[/User Query/] --> F[Retrieval]
    F --> G[Document Selection]
    G -->|Top K documents| H(Model Fine-tuning)
    H --> I[[Generation]]
    I <-.-> LLM{{LLM}}
    I -->J[/Output Answer/]
    G -->|Top K passages| K(Model Fine-tuning)
    K --> I

    subgraph RAG[Retrieval Augmented Generation]
        D 
        E
        F
        G
        H
        I
        J
        K
    end
```

In this diagram:

- Document collections are ingested, processed, and corresponding vector databases are made.
  - LLM embedding models are used for obtain the vectors.
- There are multiple local vector databases that are stored and maintained locally.
- A vector database from the local collection is selected and ingested.
- An input query provided by the user initiates the RAG workflow.
- The workflow then proceeds with: 
  - retrieval
  - document selection
  - model fine-tuning
  - answer generation
  - presenting the final output


-------

## Setup

In [1]:
use Data::Importers;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;

------

## Ingest text

Ingest the transcript of the (3.5 hours) discussion [CWv2]:

In [2]:
#my $url = 'https://podscripts.co/podcasts/modern-wisdom/747-eric-weinstein-why-does-the-modern-world-make-no-sense';
my $url = 'https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution';
my $txtEN = data-import($url, 'plaintext');

text-stats($txtEN)

(chars => 245233 words => 36863 lines => 7107)

Take the "proper transcript" part:

In [3]:
my $txtEN2 = $txtEN.substr($txtEN.index('Starting point is 00:00:00'));
text-stats($txtEN2)

(chars => 242067 words => 36490 lines => 7048)

Split into paragraphs and make the paragraphs compact:

In [4]:
my @paragraphs = $txtEN2.split(/ 'Starting point is' \h+ [\d ** 2]+ % ':' /):g;
@paragraphs .= map({ $_.subst(/\n+/, "\n"):g});
@paragraphs.elems

442

Show a sample of the paragraphs:

In [5]:
#% html
@paragraphs.pick(4) ==> to-html()

------

## Make vector database

Make an empty vector database object:

In [6]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new(name => 'No833');

VectorDatabase(:id("045f467c-193f-4df6-bec3-790d6c83ca64"), :name("No833"), :elements(0), :sources(0))

Make an LLM access specification:

In [7]:
#my $conf = llm-configuration("ChatGPT", model => 'text-embedding-002');
my $conf = llm-configuration("Gemini");

$conf.Hash.elems

24

Create the semantic index for the vector database object (an profile it):

In [8]:
my $tstart = now;
$vdbObj.create-semantic-search-index(@paragraphs, method => 'by-max-tokens', max-tokens => 2048, e => $conf):embed;
my $tend = now;
say "Time to make the semantic search index: {$tend - $tstart} seconds.";

Time to make the semantic search index: 202.806121507 seconds.


By default the vector database object is exported in a sub-directory of [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [13]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = "SemSe-{$vdbObj.id}.json";

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

True

The export path is saved in the vector database object:

In [14]:
$file.Str eq $vdbObj.location

True

Show a sample of the text chunks:

In [15]:
#% html
$vdbObj.items.pairs.pick(4).sort(*.key) ==> to-html()

0,1
38.0,"try and see how you feel as well. Best of all, there is a no questions asked refund policy with an unlimited duration so you can buy it for as long as you well. Best of all, there is a no questions asked refund policy with an unlimited duration. So you can buy it for as long as you want, try it all. And if you do not like it for any reason, they'll give you your money back and you don't even need to return the box. That's how confident they are that you love it. Right now you can get a free sample pack"

0,1
263.0,"So how can it be the case that therapy, all therapy is bad because it allows you or causes you to focus on your yourself and your issues. But you also include in that CBT, something which is unbelievably practical and shows up as an evidence-based intervention for lots of people's disorders. Is it that, and then it's the, is it that you step in to soften the blow? So throughout that episode in particular, I had to ask these questions. And then as I watch the guests, I get to this point, which is exactly the reason"

0,1
353.0,"When I did the Terence Howard thing at Joe's request, um, it generated a lot of interest and a lot of heat. I got a ton of criticism. Why would you sit down with a pseudo scientist? You're normalizing this behavior. Terence Howard is actually playing with all sorts of geometric shapes and dualities between geometric shapes that even professional mathematicians couldn't figure out. Neil Grass-Thyssen says, I don't know where these come from. Um, I didn't know where the conversation would head."

0,1
405.0,"of John Maynard Keynes, subsequent development influence to a large degree by a name I can't pronounce. I think that there was a lot of whose family comes from the far left, you recognize certain sorts of commonalities. I'm sure she would see them in me. Um, the democratic party is not communist. I don't think that that's right. That's the critique of many of my right-wing friends, but it is welcomed in a lot of neo-Marxian thought."


Show dimensions and data type of the obtained vectors:

In [16]:
say "dimensions : ", $vdbObj.vectors.&dimensions;
say "data type  : ", deduce-type($vdbObj.vectors);

dimensions : (441 768)
data type  : Assoc(Vector(Atom((Str)), 441), Tuple([(Any) => 441], 441), 441)


-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Why Does The Modern World Make No Sense? (4K)"](https://www.youtube.com/watch?v=p_swB_KS8Hw),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/747-eric-weinstein-why-does-the-modern-world-make-no-sense).)

[CWv2] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)