# Semantic search index creation 

### *Guide*

Anton Antonov    
September 2024  

-----

## Introduction

This notebook shows how to create an LLM-computed vector database over the paragraphs of relative large text.

-------

## Setup

In [1]:
use Data::Importers;
use LLM::Configurations;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use Math::Nearest;
use Math::DistanceFunctions;
use Statistics::OutlierIdentifiers;

------

## Ingest text

Ingest the transcript of the (3.5 hours) discussion [CWv1]:

In [2]:
my $url = 'https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution';
my $txtEN = data-import($url, 'plaintext');

text-stats($txtEN)

(chars => 245233 words => 36863 lines => 7107)

Take the "proper transcript" part:

In [3]:
my $txtEN2 = $txtEN.substr($txtEN.index('Starting point is 00:00:00'));
text-stats($txtEN2)

(chars => 242067 words => 36490 lines => 7048)

Split into paragraphs and make the paragraphs compact:

In [4]:
my @paragraphs = $txtEN2.split(/ 'Starting point is' \h+ [\d ** 2]+ % ':' /):g;
@paragraphs .= map({ $_.subst(/\n+/, "\n"):g});
@paragraphs.elems

442

Show a sample of the paragraphs:

In [5]:
#% html
@paragraphs.pick(4) ==> to-html()

------

## Make vector database

Make an empty vector database object:

In [6]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new(name => 'No833');

VectorDatabase(:id("266b20ca-d917-4ac0-9b0a-7c420625666c"), :name("No833"), :elements(0), :sources(0))

Make an LLM access specification:

In [7]:
my $conf = llm-configuration("ChatGPT", model => 'text-embedding-002');
#my $conf = llm-configuration("Gemini");

$conf.Hash.elems

24

Create the semantic index for the vector database object (an profile it):

In [8]:
my $tstart = now;
$vdbObj.create-semantic-search-index(@paragraphs, method => 'by-max-tokens', max-tokens => 2048, e => $conf):embed;
my $tend = now;
say "Time to make the semantic search index: {$tend - $tstart} seconds.";

Time to make the semantic search index: 235.6492823 seconds.


By default the vector database object is exported in a sub-directory of [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [9]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = "SemSe-{$vdbObj.id}.json";

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

True

The export path is saved in the vector database object:

In [10]:
$file.Str eq $vdbObj.location

True

Show a sample of the text chunks:

In [11]:
#% html
$vdbObj.text-chunks.pairs.pick(4).sort(*.key) ==> to-html()

0,1
290.0,"of abundance of women and the guest then goes, ah. So Joe doesn't make a conversation feel like an interview because he answers statements with statements. If you actually listen, a lot of the time, Joe doesn't ask that many questions in his podcast. He's not a big question asker when compared with most other podcasters. He makes statements."

0,1
382.0,"And I have an enormous number of gay friends. It's not some of my best friends are gay. It's like way too many of them are gay. So I spent a lot of time in, in gay space. And what I've learned from that is that you can go about 85% of the distance talking about relationships, sex in the abstract, hopes, dreams for the future, attraction. And then the last 15% is really different. And I don't want to be in your business at all. And it's constructed that way because we freak each other out. We don't really want the specifics of the details beyond a certain point. And I think that that last 15% can't be shared between straights and gays. We can go 85%"

0,1
420.0,And the Kennedy Shanahan ticket is sophisticated in realizing that campaigning could be something different. It's trying to figure out what should campaigning be. But it's crazy to be an all day session trying to figure out how to save the labor market from AI. And I also want to want to say something about JD Vance without naming names. And I hope JD doesn't get angry at me for this one. JD invited me out years ago to Ohio

0,1
430.0,"Nicole Shanahan and Bobby Kennedy are 100% sincere no matter how they're campaigning or what you're upset about in their off moments. And I've been with all of them. These people deeply care about the shit out of luck. They're, they're interested in taking on real power. I don't know Trump. I mean, look, you can tell it's not, there's no allegiance. I've, I can't imagine voting for Trump."


Show dimensions and data type of the obtained vectors:

In [12]:
say "dimensions : ", $vdbObj.database.&dimensions;
say "data type  : ", deduce-type($vdbObj.database)

dimensions : (441 1536)
data type  : Assoc(Atom((Str)), Vector(Atom((Numeric)), 1536), 441)


-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)