# Raku RAG demo

### ... and *Semantic Nearest Neighbors Graphs*

Anton Antonov   
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com)   
[RakuForPrediction-book at GitHub](https://github.com/antononcube/RakuForPrediction-book)      
September 2024 


-----

## Introduction

The so called ***Retrieval Augmented Generation (RAG)*** functionalities of Large Language Models (LLMs) are demonstrated using semantic nearest neighbors graphs.

Vector Databases (VDBs) made with the Raku package ["LLM::RetrievalAugmentedGeneration"](https://raku.land/zef:antononcube/LLM::RetrievalAugmentedGeneration) are used.

1. Semantic graph for words 
    - Select a set of words
    - Find LLM embeddings for the those words
    - Find nearest neighbor graph for the obtained vectors

2. Nearest neighbor graph -- joined VDBs
    - Ingest VDBs corresponding to (long) podcast interviews
        - "Modern Wisdom" with [Eric Weinstein](https://en.wikipedia.org/wiki/Eric_Weinstein)
        - Eclectic and long interviews (≈3.5 hours each)
        - Heterogeneous content (wide variety of topics)
        - *(Hence, suitable for semantic similarities demos)*
    - Merge VDBs
    - Make a corresponding semantic graph
        - Nearest neighbor graph
        - Connected components

3. Nearest neighbor graph and RAG 
    - Create a semantic relationship graph for an interview
    - LLM-derivation of summaries for top connected components
    - Graph plot traces 
        - Using suitable graph tooltips
    - Do a retrieval based LLM summary for a query

-----

## Setup

In [2]:
use Data::Importers;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use Math::Nearest;
use Math::DistanceFunctions::Native;
use Statistics::OutlierIdentifiers;

use NativeCall;

use Math::Nearest;
use Graph;
use JavaScript::D3;

### JavaScript

Here we prepare the notebook to visualize with JavaScript:

In [None]:
#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

Verification:

In [None]:
#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

Here we set a collection of visualization variables:

In [None]:
my $title-color = 'Ivory';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $background = '1F1F1F';
my $color-scheme = 'schemeTableau10';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

-------

## Words semantic graph

```mermaid
flowchart LR
    WS[Words set selection] --> WE[Word embeddings]
    WE --> WNN
    WNN[Nearest neighbor graph<br>one neighbor] --> WCC[Connected component<br>to focus on]
    WCC --> WNN2[Nearest neighbor graph<br>two neighbors]
    WNN2 --> WGP[Graph plot]
    WE <-.-> LLMs
    
    subgraph WGraph[Words Graph]
        WS
        WE
        WNN
        WNN2
        WCC
        WGP
    end

    subgraph LLMs
        OpenAI{{OpenAI}}
        Gemini{{Gemini}}
        MistralAI{{MistralAI}}
        LLaMA{{LLaMA}}
    end
```

-------

## Words graph making

Here is a set of words:

In [29]:
my @content = <angel apple ardvark bible car cat cherry plum chocolate cookie cow devil film horse house movie projector raccoon tiger tree>;

@content.elems

20

Here we specify an LLM-access configuration:

In [None]:
my $conf = llm-configuration('Gemini');
$conf.Hash.elems

Here we create semantic search index:

In [None]:
my $vdbObjSmall = create-semantic-search-index(@content, e => $conf, name => 'words')

Here we see the dimensions of the obtained vectors:

In [32]:
$vdbObjSmall.vectors».elems

{00.0 => 768, 01.0 => 768, 02.0 => 768, 03.0 => 768, 04.0 => 768, 05.0 => 768, 06.0 => 768, 07.0 => 768, 08.0 => 768, 09.0 => 768, 10.0 => 768, 11.0 => 768, 12.0 => 768, 13.0 => 768, 14.0 => 768, 15.0 => 768, 16.0 => 768, 17.0 => 768, 18.0 => 768, 19.0 => 768}

Here we find the embedding of a certain word (using the same LLM model as above):

In [33]:
my $vec = llm-embedding("coffee", e => $conf).head;
$vec.elems

768

Here we find the closest Nearest Neighbors (NNs) of that word:

In [34]:
my @nns = $vdbObjSmall.nearest($vec, 3, prop => 'label' ).map(*.Slip)

[08.0 05.0 09.0]

Here are the corresponding words:

In [35]:
$vdbObjSmall.items{@nns}

(chocolate cat cookie)

Here we find the corresponding NNs graph with 1 and 2 nns per vertex:

In [36]:
my ($gr1, $gr2) = [1, 2].map({ 
        # NNs graph edges
        my @edges = nearest-neighbor-graph(
            $vdbObjSmall.vectors.pairs, 
            $_, 
            method => 'Scan', 
            distance-function => &euclidean-distance, 
            format => 'dataset'
        );

        # Replace IDs with names
        @edges .= map({ $_<from> = $vdbObjSmall.items{$_<from>}; $_<to> = $vdbObjSmall.items{$_<to>}; $_ });
        
        # Make the graph
        Graph.new(@edges)
}).flat;

.say for ($gr1, $gr2)


Graph(vertexes => 20, edges => 14, directed => False)
Graph(vertexes => 20, edges => 27, directed => False)


Find 1-nns graph's connected components:

In [37]:
my @comps = $gr1.connected-components.sort(-*.elems)

[(movie projector bible film apple) (cow horse tree house car) (plum cookie chocolate cherry) (devil angel) (raccoon ardvark) (tiger cat)]

In [39]:
#%js

$gr2.edges(:dataset)
==> js-d3-graph-plot(
        :$background,
        highlight => [|@comps.head, |$gr1.subgraph(@comps.head).edges],
        width => 600,
        vertex-label-color => 'Ivory',
        edge-thickness => 2,
        vertex-size => 3,
        vertex-color => 'Blue',
        edge-color => 'SteelBlue',
        force => { charge => {strength => -200, iterations => 4}, collision => {iterations => 1, radius => 10} }
    )

------

## Semantic nearest neighbors graph

```mermaid
flowchart LR
    subgraph LocalVDB[Local Folder]
        direction LR
        A(Vector Database 1)
        B(Vector Database 2)
        C(Vector Database N)
    end

    subgraph Creation
        ID
        SD
        EV
        CD
    end

    subgraph NNGraph[Semantic Graph]
        D 
        E
        CC
        T 
        GP   
    end

    ID[Ingest document collection]
    SD[Split Documents]
    EV[Get LLM Embedding Vectors]
    CD[Create Vector Database]
    ID --> SD --> EV --> CD

    EV <-.-> LLMs
    
    CD -.- CArray[[CArray<br>representation]]

    CD -.-> |export| LocalVDB

    LocalVDB -.- JSON[[JSON<br>representation]]

    LocalVDB -.-> |import|D[Ingest Vector Database]
 
    D -.- CArray
    D --> E
    E[Nearest neighbor graph] --> CC[Connected Components]
    E -.- |nearest neighbors<br>distance function|CArray
    CC --> T[LLM-derived titles<br>per component]
    T --> GP[Graph plot]

    subgraph LLMs
        OpenAI{{OpenAI}}
        Gemini{{Gemini}}
        MistralAI{{MistralAI}}
        LLaMA{{LLaMA}}
    end
```

------

## Ingest vector databases

Ingest vector database from the default directory and show a summary:

In [40]:
#% html
my @field-names = <id name item-count dimension version llm-service llm-embedding-model created>;

vector-database-objects(f=>'hash', :flat)

==> { $_.map({ $_<created> = $_<file>.IO.created.DateTime.Str.subst('T',' ').substr(^19); $_}).sort(*<created>).reverse }()
==> to-html(:@field-names)

id,name,item-count,dimension,version,llm-service,llm-embedding-model,created
5097c865-7fad-43d9-b7ca-456220d754b7,words,20,768,0,gemini,embedding-001,2024-09-18 14:08:44
045f467c-193f-4df6-bec3-790d6c83ca64,No833,441,768,0,gemini,embedding-001,2024-09-16 22:21:25
d2effebc-2cef-4b2b-84ca-5dcfa3c1864b,No747,283,1536,0,chatgpt,text-embedding-3-small,2024-09-12 13:32:30
5cb40fbb-9f69-48ca-9fc1-03ec8059ed99,No747,283,768,0,gemini,embedding-001,2024-09-12 13:32:23
44f19858-730e-4b96-86b7-81e701f9df8f,No747,283,1536,0,chatgpt,text-embedding-3-small,2024-09-12 13:32:18
266b20ca-d917-4ac0-9b0a-7c420625666c,No833,441,1536,0,chatgpt,text-embedding-3-small,2024-09-12 13:32:11


Get specific databases:

In [41]:
#my @vdbs = vector-database-objects(f=>'hash', :flat).grep({ $_<name> ∈ <No833 No747> && $_<llm-service> eq 'gemini' }).map({ create-vector-database(file => $_<file>) })

my @vdbs = vector-database-objects(f=>'hash', :flat).grep({ $_<id> ∈ <266b20ca-d917-4ac0-9b0a-7c420625666c d2effebc-2cef-4b2b-84ca-5dcfa3c1864b> }).map({ create-vector-database(file => $_<file>) })

[VectorDatabase(:id("266b20ca-d917-4ac0-9b0a-7c420625666c"), :name("No833"), :elements(441), :sources(442)) VectorDatabase(:id("d2effebc-2cef-4b2b-84ca-5dcfa3c1864b"), :name("No747"), :elements(283), :sources(284))]

Sample vectors:

In [42]:
.say for @vdbs.head.vectors.pick(3)

306.0 => NativeCall::Types::CArray[num64].new
231.0 => NativeCall::Types::CArray[num64].new
167.0 => NativeCall::Types::CArray[num64].new


Sample items:

In [43]:
#% html
@vdbs.head.items.pick(3).map({ <key value> Z=> $_.kv })».Hash.Array 
==> to-html(field-names => <key value>, align => 'left')

key,value
292.0,"I've been trying to cultivate that because I think it makes for such a beautiful conversational flow. One of the problems that you have is as you start to push the guests' expertise, the delta between yours and the guests' expertise, your ability to answer statements with statements becomes lower. You need to say, what do you mean by that? Or how's that the case? Or what would you say is this thing?"
154.0,"around trying to destroy her. Yep. And the same was true. It's also, it's magnified, I think, not just by the descent, but also by the platform, same one exposure. People got jealous of exposure. And I don't think it's that. Oh, I think that it is very, very obvious that if somebody gets attention"
17.0,"institutions that sprang from democracy once upon a time, and that those institutions have to be kept strong. Those are two completely different concepts that are overloaded to the same word. Under that circumstance, we have a paradox, which is how do we keep the electorate from overturning the type A democracy from overturning the type B democracy? And that's the unsolved problem that they will"


-----

## Combined databases graph

Here we join the selected databases:

In [44]:
my $vdbObj2 = vector-database-join(@vdbs)

VectorDatabase(:id("dc9c6550-a84a-4332-bc1b-771f96da282d"), :name("No833-AND-No747"), :elements(724), :sources(726))

Find a nearest neighbors graph:

In [45]:
my @edges2 = nearest-neighbor-graph($vdbObj2.vectors.pairs, 1, method => 'Scan', distance-function => &euclidean-distance, format => 'raku');

my $gr2 = Graph.new(@edges2)

Graph(vertexes => 724, edges => 565, directed => False)

In [46]:
my @comps2 = $gr2.connected-components.sort(-*.elems).head(8);

@comps2.elems

8

In [47]:
#% js
@edges2
==> js-d3-graph-plot(
        :$background,
        highlight => @comps2.map({[|$_, |$gr2.subgraph($_).edges]}),
        vertex-label-color => 'none',
        edge-thickness => 2,
        vertex-size => 4,
        vertex-color => 'Blue',
        width => 1100,
        height => 950,
        edge-color => 'Gray',
        vertex-color => 'DarkGray',
        #force => {charge => {strength => -3, iterations => 4}, collision => {radius => 14, iterations => 4}, link => {distance => 1};}
    )

-----

## Nearest neighbor graph

Take one of the vector databases:

In [48]:
my $vdbObj = @vdbs.head

VectorDatabase(:id("266b20ca-d917-4ac0-9b0a-7c420625666c"), :name("No833"), :elements(441), :sources(442))

In [49]:
.say for $vdbObj.items.pick(3)

011.0 => Okay. We only have one candidate that's acceptable to the international order. Donald Trump will be under, um, constant pressure that he's a loser. He's a wild man. He's an idiot. And, and he's under the control of the Russians. And then he was going to be, you know, a 20 to one underdog.
098.0 => Okay. Illustrious. And he was a shitty, shitty physics student. And he said, you know what, I'm going to use the fact that I'm a shitty physics student, like below average at, at, you know, Princeton's like one of the greatest physics departments of all time. And he said, I'm going to approach Freeman Dyson at the Institute for Advanced Study and see whether I can work out how to make a fission bomb that would actually work.
155.0 => and someone else feels that it's undeserving in one form or another, that guy's a phony and look at all of the, whatever they get. I think there's some of that, but I think to think that that's what it is, is mistaken. Not entirely, but I think that it's

Make the nearest neighbor graph for the vectors in the vector database:

In [50]:
my @edges = nearest-neighbor-graph($vdbObj.vectors.pairs, 1, method => 'Scan', distance-function => &euclidean-distance, format => 'raku')

[298.0 => 299.0 195.0 => 098.0 051.0 => 053.0 433.0 => 438.0 021.0 => 436.0 092.0 => 026.0 310.0 => 311.0 426.0 => 425.0 323.0 => 324.0 336.0 => 266.0 422.0 => 428.0 333.0 => 332.0 240.0 => 239.0 200.0 => 216.0 217.0 => 179.0 174.0 => 216.0 165.0 => 166.0 224.0 => 223.0 127.0 => 126.0 212.0 => 216.0 082.0 => 165.0 283.0 => 282.0 263.0 => 261.0 104.0 => 096.0 415.0 => 001.0 434.0 => 268.0 196.0 => 437.0 314.0 => 315.0 089.0 => 090.0 042.0 => 041.0 077.0 => 078.0 372.0 => 249.0 302.0 => 289.0 329.0 => 044.0 185.0 => 184.0 353.0 => 176.0 397.0 => 399.0 111.0 => 112.0 182.0 => 181.0 172.0 => 179.0 432.0 => 434.0 313.0 => 314.0 164.0 => 199.0 001.0 => 415.0 046.0 => 088.0 325.0 => 339.0 277.0 => 278.0 403.0 => 392.0 413.0 => 430.0 248.0 => 249.0 169.0 => 174.0 304.0 => 300.0 294.0 => 113.0 114.0 => 115.0 266.0 => 261.0 362.0 => 363.0 261.0 => 266.0 079.0 => 078.0 047.0 => 217.0 342.0 => 341.0 184.0 => 185.0 317.0 => 314.0 124.0 => 160.0 206.0 => 216.0 430.0 => 417.0 251.0 => 250.0 035.0 => 

Make the graph:

In [51]:
my $gr = Graph.new(@edges)

Graph(vertexes => 441, edges => 346, directed => False)

Get graph's connected components:

In [52]:
my @comps = $gr.connected-components.sort(-*.elems);
.say for @comps.head(12)

(335.0 208.0 209.0 210.0 303.0 305.0 293.0 296.0 287.0 286.0 385.0 304.0 300.0 298.0 299.0 388.0 302.0 292.0 291.0 290.0 277.0 278.0 289.0 288.0 285.0 207.0)
(419.0 022.0 023.0 036.0 393.0 420.0 418.0 403.0 048.0 392.0 391.0 417.0 430.0 429.0 431.0 413.0 019.0)
(387.0 269.0 214.0 213.0 194.0 206.0 212.0 201.0 200.0 216.0 154.0 169.0 174.0 376.0 386.0 134.0)
(229.0 171.0 068.0 331.0 170.0 172.0 153.0 191.0 177.0 183.0 179.0 349.0 258.0 217.0 047.0)
(141.0 122.0 186.0 137.0 127.0 158.0 163.0 162.0 139.0 126.0 241.0 076.0 050.0)
(124.0 136.0 135.0 131.0 128.0 125.0 161.0 138.0 198.0 199.0 164.0 160.0 107.0)
(274.0 389.0 132.0 071.0 070.0 366.0 416.0 415.0 441.0 002.0 003.0 001.0)
(411.0 097.0 195.0 098.0 101.0 103.0 099.0 116.0 105.0 104.0 096.0)
(021.0 436.0 265.0 326.0 295.0 262.0 297.0 264.0 435.0 020.0)
(402.0 406.0 404.0 405.0 218.0 215.0 275.0 221.0 412.0 028.0)
(338.0 337.0 321.0 340.0 348.0 334.0 325.0 339.0 284.0)
(336.0 232.0 266.0 365.0 263.0 261.0 437.0 196.0)


Example paragraph:

In [53]:
#% markdown

$vdbObj.items<170.0>

into a gambit. So you try not answering the criticism and then it becomes, why won't he answer his critics? And then you're saying, well, are you applying this criticism uniformly? Are you... criticism, um, uniformly. Are you,

LLM function for naming a set of paragraphs:

In [54]:
my &fNamer = llm-function({"Summarize the text into a very short sentence that has at most 8 words: \n\n $_"})

-> **@args, *%args { #`(Block|2985799835096) ... }

Example title finding:

In [55]:
@comps[5]

(124.0 136.0 135.0 131.0 128.0 125.0 161.0 138.0 198.0 199.0 164.0 160.0 107.0)

In [56]:
&fNamer( $vdbObj.items{|@comps[5]}.join("\n") )

String theorists mistreat others, lack real achievements in physics.

Find titles for some of the largest components:

In [58]:
my @titles = @comps.head(6).map({ &fNamer( $vdbObj.items{|$_}.join("\n") ) });

@titles.elems

6

In [59]:
.say for @titles

Joe Rogan is a great interviewer who makes guests shine.
Shanahan and Kennedy offer alternative to Trump and Harris.
Criticism inaccuracy magnified on the internet, causing harm.
Criticism online is mostly from stalkers, not critics.
String theory critics challenge its validity and relevance.
String theorists are criticized for lack of achievements.


Make rules for all components and titles:

In [60]:
my @res = do for ^@titles.elems -> $i {
    my @vals = @comps[$i].Array X~ ' : ' X~  @titles[$i];
    @comps[$i].Array Z=> @vals
}

my %rules = @res.map(*.Slip);

%rules.elems


100

Make graph highlight specification:

In [61]:
my @colors = <#1f77b4 #ff7f0e #2ca02c #d62728 #9467bd #8c564b #e377c2 #7f7f7f #bcbd22 #17becf>;

my $highlight = (@colors.head(6).Array Z=> @comps.head(6).map({ [ |$_.map({ %rules{$_} // $_ }), |$gr.subgraph($_).edges.map({ ( %rules{$_.key} // $vdbObj.items{$_.key} ) => ( %rules{$_.value} // $vdbObj.items{$_.value} ) }) ]})).Hash;

{#1f77b4 => [335.0 : Joe Rogan is a great interviewer who makes guests shine. 208.0 : Joe Rogan is a great interviewer who makes guests shine. 209.0 : Joe Rogan is a great interviewer who makes guests shine. 210.0 : Joe Rogan is a great interviewer who makes guests shine. 303.0 : Joe Rogan is a great interviewer who makes guests shine. 305.0 : Joe Rogan is a great interviewer who makes guests shine. 293.0 : Joe Rogan is a great interviewer who makes guests shine. 296.0 : Joe Rogan is a great interviewer who makes guests shine. 287.0 : Joe Rogan is a great interviewer who makes guests shine. 286.0 : Joe Rogan is a great interviewer who makes guests shine. 385.0 : Joe Rogan is a great interviewer who makes guests shine. 304.0 : Joe Rogan is a great interviewer who makes guests shine. 300.0 : Joe Rogan is a great interviewer who makes guests shine. 298.0 : Joe Rogan is a great interviewer who makes guests shine. 299.0 : Joe Rogan is a great interviewer who makes guests shine. 388.0 : Joe 

Semantic graph:

In [62]:
#%js
my @edges2 = @edges.map({ ( %rules{$_.key} // $vdbObj.items{$_.key} ) => ( %rules{$_.value} // $vdbObj.items{$_.value} ) });

@edges2
==> js-d3-graph-plot(
        :$background,
        :$highlight,
        vertex-label-color => 'none',
        edge-thickness => 2,
        vertex-size => 4,
        vertex-color => 'Blue',
        width => 1100,
        height => 700,
        edge-color => 'Gray',
        vertex-color => 'DarkGray',
    )

Generate LLM based on one of components:

In [63]:
#% markdown

llm-synthesize([ 
    "Summarize the following text into a list of three-four points, each with no more than 8 words:",
    $vdbObj.items{|@comps[5].sort}.join(" "),
], e => $conf)

- Foreign students study irrelevant theories.
- String theorists suppress competing theories.
- Lawrence Krauss and Leonard Susskind criticized for their behavior.
- Peter Wojt and Sean Carroll debate the value of string theory.

-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Why Does The Modern World Make No Sense? (4K)"](https://www.youtube.com/watch?v=p_swB_KS8Hw),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/747-eric-weinstein-why-does-the-modern-world-make-no-sense).)

[CWv2] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)