# Question Answering System by Retrieval Augmented Generation

### *Guide*

Anton Antonov    
September 2024  

-----

## Introduction

This notebook shows how to import an LLM-computed vector database and then LLM-generate from it responses to certain queries.

-------

## Setup

Packages used below:

In [19]:
use Data::Importers;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use Math::Nearest;
use Math::DistanceFunctions::Native;
use Statistics::OutlierIdentifiers;

use NativeCall;

A special LLM configuration:

In [2]:
my $conf4o = llm-configuration('ChatGPT', model => 'gpt-4o', max-tokens => 4096, temperature => 0.4);
$conf4o.Hash.elems

24

-----

## Import Vector Database

In this section we import the vector database and do certain basic statistics over it.

Here we make and empty vector database object:

In [3]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new();

VectorDatabase(:id("5f0a1194-527a-4000-ad92-08739a4fc8a7"), :name(""), :elements(0), :sources(0))

We can see the gists of the available _pre-computed_ vector databases with `vector-database-objects`.
Here we tabulate the contents of those gists:

In [4]:
#% html
vector-database-objects(format=>'map') ==> to-html(field-names => <id name document-count item-count version>)

id,name,document-count,item-count,version
44f19858-730e-4b96-86b7-81e701f9df8f,No747,284,283,0
5cb40fbb-9f69-48ca-9fc1-03ec8059ed99,No747,284,283,0
266b20ca-d917-4ac0-9b0a-7c420625666c,No833,442,441,0
d2effebc-2cef-4b2b-84ca-5dcfa3c1864b,No747,284,283,0


Here using we form a file path for a previously computed (and exported) vector database using [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [5]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = 'SemSe-266b20ca-d917-4ac0-9b0a-7c420625666c.json';

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

True

Import the vector database:

In [6]:
my $tstart = now;

$vdbObj.import($file);

my $tend = now;

say "Import time { $tend - $tstart } seconds.";

Import time 4.587041379 seconds.


Here is the vector database object's _gist_:

In [7]:
say $vdbObj;

VectorDatabase(:id("266b20ca-d917-4ac0-9b0a-7c420625666c"), :name("No833"), :elements(441), :sources(442))


Show text chunks sample:

In [8]:
#% html
$vdbObj.items.pick(3) ==> to-html()

0,1
242.0,"that we need to do. Another version of this, by the way, is some giant percentage of the population says, I don't understand your argument when they say, when they really mean I don't accept your argument. For example, you could ask me, I don't, you could say, Eric, I don't accept your argument. For example, you could ask me, you could say, Eric, I don't understand antisemitism. Jews do so much, they contribute to society. I would say, I understand antisemitism."

0,1
324.0,"I'm aberrant. It's like, I got it. I really got it. You don't like me. I don't like you. You're bad people to me. You can think that I'm the student who's just disagreeable. But the fact of the matter is life depends on disagreeable people."

0,1
228.0,"hold a pro-life or pro-choice position. The comments are dominated by people saying, of course it's a life, this is a monstrous question. How can you even think about this? I don't believe in murder. So in other words, it's not even the pro-choice people, but the pro-life people who are dominating the comments. And it goes back to Yates with the idea that, uh, the worst are full of passionate intensity and the best lack all conviction. It's not right, but it's the people within a clear ideological position feel very comfortable speaking. And the people who have a nuanced position have learned their lesson to shut"


Show vector database dimensions:

In [10]:
$vdbObj.vectors.&dimensions

(441 1536)

Show vectors' norms:

In [11]:
$vdbObj.vectors.pick(3).deepmap({ norm($_.value) })

(1.0000000395436737 1.0000000222755683 0.9999999932222275)

Here is a summary over all vectors:

In [12]:
sink records-summary($vdbObj.vectors.values.map({ norm($_) }))

+------------------------------+
| numerical                    |
+------------------------------+
| 3rd-Qu => 1.0000000370665862 |
| Min    => 0.9999999127751142 |
| Mean   => 1.0000000117763133 |
| Max    => 1.000000100169172  |
| 1st-Qu => 0.9999999894227214 |
| Median => 1.000000012338632  |
+------------------------------+


------

## Themes found in the text chunks

The vector database object has the attributes:
- `database` which is a `Map` of labels (identifiers) to LLM embedding vectors
- `text-chunks` which is a `Map` of labels to text chunks that correspond to the embedding vectors

(The keys of `database` and `text-chunks` are the same.)


By examining the LLM-extracted themes from text chunks of the imported vector database, 
we see that the discussion they came from is [fairy eclectic](https://www.youtube.com/watch?v=PYRYXhU4kxM), [CWv1]:

In [None]:
#% html
# my $res = llm-synthesize([
#     llm-prompt("ThemeTableJSON")(
#         $vdbObj.items.sort(*.key)».value.join("\n"), 'text', 15, 
#     )
#     ],
#     e => $conf4o,
#     form => sub-parser('JSON'):drop
# );

# $res ==> to-html(field-names => <theme content>, align => 'left')

**Remark:** It is instructive to compare the extracted themes with the list video segments given in [CWv1]. 

-----

## Nearest neighbors finding

Here is a query:

In [13]:
my $query = 'What is the state of string theory?';

What is the state of string theory?

Here we find the labels of the vectors (and text chunks) that _considered_ nearest neighbors of certain query:

In [14]:
my @nnLabels = $vdbObj.nearest($query, 10, prop => <label distance>, distance-function => &euclidean-distance);

@nnLabels ==> deduce-type

Vector(Tuple([Atom((Str)), Atom((Numeric))]), 10)

**Remark:** Note that parallel execution was specified with `degree => 4`.

Here we make the corresponding dataset that included the text chunks from the vector database:

In [15]:
my @dsScores = @nnLabels.map({
    %( label => $_[0], distance => $_[1], text => $vdbObj.items{$_[0]} )
});

@dsScores.&dimensions

(10 3)

Here we show the dataset:

In [16]:
#% html
@dsScores ==> to-html(field-names => <distance label text>, align => 'left');

distance,label,text
0.9066815079532048,126.0,"But is somebody at the forefront of string theory? Absolutely. And he said, quote, I can tell you with absolute certainty, string theory is not the theory of the real world. I can tell you that 100%. My strong feelings are exactly that string theory is definitely not the theory of the real world. I can tell you that 100%. My strong feelings are exactly that string theory is definitely not the theory of the real world. Is that taking it out of context? Is that him framing it somewhere else? Or does that encapsulate the fact"
0.9301226226546816,139.0,"And I saw a tweet saying that somebody had been to a string theory convention and had asked the question, what is string theory? And the best string theorists on the planet came up with the answer, we kind of don't know what string theory is. And the other answer is whatever it is that we're doing. Whatever it is that the string theory community is doing. Even if they did something that had nothing to do with string theory,"
0.9664632700051048,121.0,"That is not shiny. That is saying everything else is crap and dangerous. In other words, it's string theory can't sell itself as physics. By any telling of the story, string theory is the most failed theory in the history of physics. If you look at the number of papers, the amount of money, the number of people, the number of PhDs, number of conferences, achievements in physics proper per investment or size of effort. It is the most failed theory in the history of physics and the way in which it survives is by hunting and destroying its enemies and making its"
0.9696340383729332,112.0,"We don't know whether that we're talking about the stagnation of theoretical physics or just nuclear physics. You're okay with speculating. Let's speculate. I'll do the decision tree. One possibility is that they're simply saying that they made nuclear physics very, very difficult to do. And that has to do with not very sexy physics, the physics of protons and neutrons in nuclear. So that branch exists. The other branch says, um, we used string theory to cock block actual progress in theoretical physics and derailed an entire field, at least"
1.008971602214594,162.0,"Everybody in the community reads it and many people pretend that they don't because it's very critical of string theory, but he's very, very good. Then he writes a book like this. Nobody saw it coming. Then he comes up with two theories, both of which I of string theory, but he's very, very good. Then he writes a book like this, nobody saw it coming. Then he comes up with two theories, both of which I think are wrong, but are really, really clever about the nature of the strong force, what would be called weaker hypercharge"
1.0202723899830015,138.0,"and removed with extreme prejudice. It's anti-science. So I don't know much or anything really about the inner workings of string theory, but Sabine Hossenfeld has been on the show, Brian Greene's been on the show, Sean Carroll's been on the show. Oh, let's get them, all of them."
1.027078946717563,127.0,"that he thinks string theory is a dead end that doesn't describe the world? He's playing a game that I would, I would say is Logomachy, an argument over words, where he says that big S string theory is not the theory of the real world, which is the theory that was used to destroy all of its competitors and that little S string theory exists. I don't, this is basically the attempt, uh, to take a school massacre and plead to a parking ticket. And no, I think that the prosecution should decline the offer from the good Dr. Suskin and say, no, no, no, you have 40 years of the destruction of your colleagues to answer for you've chosen to be, um, words, family, an asshole,"
1.0748528004708455,141.0,"No, Mr. Smart, I don't believe that either. Two Cub Scouts with slingshots. So, this is a very old pattern. Yeah. Is this too far gone for string theory now? Is it the mask is beginning to slip to the point where even Ed Dutton's going to have to eat his words within the next decade?"
1.0871593363303194,137.0,"I have my own theory and I'm happy to fight with Peter, but Peter and I have been friends for all these years. Uh, I would love to have Nima Arkani Hamed and Ed Frankel and others, uh, judge this people who aren't really string theorists who appreciate the best parts of string inspired mathematics, let's say, or string inspired mechanisms in physics. There is, the equations are not without interest or merit. It's the, the sociology should be hunted"
1.0941371063568304,125.0,"Suskin being one of the best theoretical physicists ever. No. No. Why is he somebody worth listening to then? Um, he's very, very smart and he's one of the most important string theorists ever, and he writes exceptionally clear and correct introductory books. Okay. But he is not a leading physicist."


-----

## Nearest neighbors finding (low-level)

In this section we show how to find the elements of the vector database that are _considered_ nearest neighbors to the query vector. 
We use "low-level" computations for didactic purposes.
Same or similar results can be obtained by using the method `nearest` of the vector database object.

Here is its vector embedding (with the same LLM that was used to compute the vector database):

In [23]:
# Get the LLM embedding
my $query-vector = llm-embedding($query, llm-evaluator => $vdbObj.llm-configuration).head;

# Convert to CArray
$query-vector = CArray[num64].new($query-vector);

$query-vector.elems

1536

For each vector in the database find its distance to the query vector:

In [25]:
my @dsScores =
        $vdbObj.vectors.hyper(batch => ceiling($vdbObj.item-count / 4), degree => 4).map({ %(
            label => $_.key,
            distance => euclidean-distance($_.value, $query-vector),
            text => $vdbObj.items{$_.key}
        ) }).Array;

@dsScores.&dimensions

(441 3)

**Remark:** In the computation of the distances we use parallel processing via `hyper`. 

Sort in ascending order:

In [26]:
@dsScores .= sort({ $_<distance> });

@dsScores.map(*<distance>).head(6)

(0.9066815079532047 0.9301226226546815 0.9664632700051049 0.9696340383729332 1.008971602214594 1.0202723899830017)

Show the text chunks closest to the query:

In [28]:
#% html
@dsScores.head(8) ==> to-html(field-names => <distance label text>, align => 'left');

distance,label,text
0.9066815079532048,126.0,"But is somebody at the forefront of string theory? Absolutely. And he said, quote, I can tell you with absolute certainty, string theory is not the theory of the real world. I can tell you that 100%. My strong feelings are exactly that string theory is definitely not the theory of the real world. I can tell you that 100%. My strong feelings are exactly that string theory is definitely not the theory of the real world. Is that taking it out of context? Is that him framing it somewhere else? Or does that encapsulate the fact"
0.9301226226546816,139.0,"And I saw a tweet saying that somebody had been to a string theory convention and had asked the question, what is string theory? And the best string theorists on the planet came up with the answer, we kind of don't know what string theory is. And the other answer is whatever it is that we're doing. Whatever it is that the string theory community is doing. Even if they did something that had nothing to do with string theory,"
0.9664632700051048,121.0,"That is not shiny. That is saying everything else is crap and dangerous. In other words, it's string theory can't sell itself as physics. By any telling of the story, string theory is the most failed theory in the history of physics. If you look at the number of papers, the amount of money, the number of people, the number of PhDs, number of conferences, achievements in physics proper per investment or size of effort. It is the most failed theory in the history of physics and the way in which it survives is by hunting and destroying its enemies and making its"
0.9696340383729332,112.0,"We don't know whether that we're talking about the stagnation of theoretical physics or just nuclear physics. You're okay with speculating. Let's speculate. I'll do the decision tree. One possibility is that they're simply saying that they made nuclear physics very, very difficult to do. And that has to do with not very sexy physics, the physics of protons and neutrons in nuclear. So that branch exists. The other branch says, um, we used string theory to cock block actual progress in theoretical physics and derailed an entire field, at least"
1.008971602214594,162.0,"Everybody in the community reads it and many people pretend that they don't because it's very critical of string theory, but he's very, very good. Then he writes a book like this. Nobody saw it coming. Then he comes up with two theories, both of which I of string theory, but he's very, very good. Then he writes a book like this, nobody saw it coming. Then he comes up with two theories, both of which I think are wrong, but are really, really clever about the nature of the strong force, what would be called weaker hypercharge"
1.0202723899830015,138.0,"and removed with extreme prejudice. It's anti-science. So I don't know much or anything really about the inner workings of string theory, but Sabine Hossenfeld has been on the show, Brian Greene's been on the show, Sean Carroll's been on the show. Oh, let's get them, all of them."
1.027078946717563,127.0,"that he thinks string theory is a dead end that doesn't describe the world? He's playing a game that I would, I would say is Logomachy, an argument over words, where he says that big S string theory is not the theory of the real world, which is the theory that was used to destroy all of its competitors and that little S string theory exists. I don't, this is basically the attempt, uh, to take a school massacre and plead to a parking ticket. And no, I think that the prosecution should decline the offer from the good Dr. Suskin and say, no, no, no, you have 40 years of the destruction of your colleagues to answer for you've chosen to be, um, words, family, an asshole,"
1.0748528004708455,141.0,"No, Mr. Smart, I don't believe that either. Two Cub Scouts with slingshots. So, this is a very old pattern. Yeah. Is this too far gone for string theory now? Is it the mask is beginning to slip to the point where even Ed Dutton's going to have to eat his words within the next decade?"


Show the nearest neighbors scores (with the chosen distance function):

In [29]:
text-list-plot(@dsScores.map(*<distance>), width => 100, height => 16)

+----+-------------------+-------------------+-------------------+-------------------+-------------+      
|                                                                                                  |      
+                                                                                         ****     +  1.40
|                                                              ****************************        |      
|                             **********************************                                   |      
+                  ************                                                                    +  1.30
|            *******                                                                               |      
+         ****                                                                                     +  1.20
|       ***                                                                                        |      
+      **                            

From the plot we can see that there are clear outliers. Here are find outliers' positions, [AA1, AAp8]:

In [30]:
my @pos = outlier-identifier(@dsScores.map(*<distance>), identifier => (&bottom-outliers o &quartile-identifier-parameters));

@pos.max

74

**Remark:** We show only the max outlier position since the distances are sorted in ascending order.

**Remark:** The outlier identifiers `&hampel-identifier-parameters` and `&splus-quartile-identifier-parameters` give 84 and 22 outliers, respectively.

The text chunks that correspond to the found outliers are considered to be the most relevant to the query and can be used to compose LLM prompts fo for the query.

**Remark:** For given vector database object `$vbObj` the text chunks corresponding to the vectors are accessed with `$vbObj.items`. (The vectors are accessed with `$vbObj.vectors`.)

-----

## Answer based on nearest neighbors

Generate an answer:

In [None]:
# % markdown
my $answer = llm-synthesize([
    'Come up with a narration answering this question:',
    $query,
    "using these discussion statements:",
    @dsScores.head(40).map(*<text>).join("\n")
    ],
    e => $conf4o
);

$answer

The state of string theory is a complex and contentious topic within the realm of theoretical physics. While there are certainly individuals at the forefront of the field, the consensus among some prominent physicists is far from optimistic. One leading figure has unequivocally stated, "I can tell you with absolute certainty, string theory is not the theory of the real world." This sentiment is echoed by others who share the view that string theory, despite its mathematical elegance, does not accurately describe our universe.

This skepticism is not without basis. At a recent string theory convention, when asked, "What is string theory?" the best minds in the field could only muster responses like, "We kind of don't know what string theory is," or more vaguely, "Whatever it is that we're doing." This highlights a fundamental issue: even the experts are grappling with the very definition and scope of string theory.

Critics argue that string theory has become more of a sociological phenomenon than a scientific one. It has been described as "the most failed theory in the history of physics" when measured by the number of papers published, the amount of funding received, and the number of PhDs awarded, relative to its tangible achievements in physics. The theory's survival, some claim, hinges on its ability to marginalize and discredit alternative approaches, a tactic that has stymied progress in theoretical physics.

The debate extends beyond the scientific community. Some believe that the focus on string theory has diverted attention and resources from other potentially fruitful areas of research. This "obsession" with string theory is seen by some as a shiny, tempting distraction that has curtailed exploration in other domains of physics.

Despite these criticisms, there are those who see value in the mathematical structures and mechanisms inspired by string theory. They argue that while the theory itself may not describe the real world, it has nonetheless contributed valuable insights and tools to the broader field of physics.

In summary, the state of string theory is one of profound division. While it remains a significant and influential area of study, its status as a viable theory of the universe is increasingly questioned. The field is at a crossroads, with some advocating for a reevaluation of its role in theoretical physics and others continuing to explore its mathematical and conceptual potential. The future of string theory, therefore, remains uncertain, caught between its past promises and present criticisms.

-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)