# Semantic search index creation 

### *Guide*

Anton Antonov    
September 2024  

-----

## Introduction

This notebook shows how to create an LLM-computed vector database over the paragraphs of relative large text.

-------

## Setup

In [1]:
use Data::Importers;
use LLM::Configurations;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use Math::Nearest;
use Math::DistanceFunctions;
use Statistics::OutlierIdentifiers;

------

## Ingest text

Ingest the transcript of the (3.5 hours) discussion [CWv1]:

In [2]:
my $url = 'https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution';
my $txtEN = data-import($url, 'plaintext');

text-stats($txtEN)

(chars => 245233 words => 36863 lines => 7107)

Take the "proper transcript" part:

In [3]:
my $txtEN2 = $txtEN.substr($txtEN.index('Starting point is 00:00:00'));
text-stats($txtEN2)

(chars => 242067 words => 36490 lines => 7048)

Split into paragraphs and make the paragraphs compact:

In [4]:
my @paragraphs = $txtEN2.split(/ 'Starting point is' \h+ [\d ** 2]+ % ':' /):g;
@paragraphs .= map({ $_.subst(/\n+/, "\n"):g});
@paragraphs.elems

442

Show a sample of the paragraphs:

In [5]:
#% html
@paragraphs.pick(4) ==> to-html()

------

## Make vector database

Make an empty vector database object:

In [6]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new(name => 'No833');

VectorDatabase(:id("603118ff-5738-4328-92cc-7aa2a261714d"), :name("No833"), :elements(0), :sources(0))

Make an LLM access specification:

In [12]:
#my $conf = llm-configuration("ChatGPT", model => 'text-embedding-002');
my $conf = llm-configuration("Gemini");

$conf.Hash.elems

24

Create the semantic index for the vector database object (an profile it):

In [13]:
my $tstart = now;
$vdbObj.create-semantic-search-index(@paragraphs, method => 'by-max-tokens', max-tokens => 2048, e => $conf):embed;
my $tend = now;
say "Time to make the semantic search index: {$tend - $tstart} seconds.";

Time to make the semantic search index: 211.465842025 seconds.


By default the vector database object is exported in a sub-directory of [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [31]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = "SemSe-{$vdbObj.id}.json";

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

True

The export path is saved in the vector database object:

In [None]:
$file.Str eq $vdbObj.location

Show a sample of the text chunks:

In [39]:
#% html
$vdbObj.text-chunks.pairs.pick(4).sort(*.key) ==> to-html()

0,1
92.0,"What if the idea is that an outbreak of truth and democracy would destroy NATO and the world order? Let's imagine that that would undo the markets that would spread nukes. You know, what happens if, if ending, uh, the control of social media would mean that weaponized anthrax plant plans could be spread frictionlessly. If four amino acids lead to worldwide lockdowns,"

0,1
177.0,"a lot of what you're doing is just very, very low quality. But this thing that you're doing over here, I'll vouch for it, I'll put my name behind that. That's really clever and really good. If I can't find one jewel, one gem, one positive thing to say, it's, it's for somebody else. And part of this is I'm trying to indicate, this is what criticism actually is. I'm modeling what criticism. I wasn't weak. I didn't shy away from what I considered to be the sort of uninformed or pseudoscientific or historically"

0,1
266.0,"But then if you were to see Douglas Murray and Malcolm Gladwell on stage together, or Ben Shapiro and anybody, and you go, they're able to be disagreeable so seamlessly. For me to get even 5% of the way there, I need to do the equivalent of a one rep max to ask Abigail Shrier about CBT. And that's for me just an obvious area"

0,1
323.0,"Every second of my life spent in your classrooms before college, before university is a second I want back is trauma is pain. All you did is instill in me that I'm an idiot. I'm a moron. I'm not good enough. I should go away. I'm bad. I'm aberrant. It's like, I got it. I really got idiot. I'm a moron. I'm not good enough. I should go away. I'm bad."


Show dimensions and data type of the obtained vectors:

In [None]:
say "dimensions : ", $vdbObj.database.&dimensions;
say "data type  : ", deduce-type($vdbObj.database)

-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)