# Question Answering System by Retrieval Augmented Generation

### *Guide*

Anton Antonov    
September 2024  

-----

## Introduction

This notebook shows how to import an LLM-computed vector database and then LLM-generate from it responses to certain queries.

-------

## Setup

In [3]:
use Data::Importers;
use LLM::Configurations;
use LLM::Functions;
use XDG::BaseDirectory :terms;

use LLM::RetrievalAugmentedGeneration;
use LLM::RetrievalAugmentedGeneration::VectorDatabase;

use Data::Reshapers;
use Data::Summarizers;
use Math::Nearest;
use Math::DistanceFunctions;
use Statistics::OutlierIdentifiers;

-----

## Import Vector Database

In this section we import the vector database and do certain basic statistics over it.

Here we make and empty vector database object:

In [2]:
my $vdbObj = LLM::RetrievalAugmentedGeneration::VectorDatabase.new();

VectorDatabase(:id("d6093197-f28c-4345-82c7-43afbedfb492"), :name(""), :elements(0), :sources(0))

Here using we form a file path for a previously computed (and exported) vector database using [`$XDG_DATA_HOME`](https://specifications.freedesktop.org/basedir-spec/latest/index.html):

In [4]:
# The sub-directory
my $dirname = data-home.Str ~ '/raku/LLM/SemanticSearchIndex';

# The exported vector database base file name
my $basename = 'SemSe-603118ff-5738-4328-92cc-7aa2a261714d.json';

# Corresponding IO:Path object
my $file = IO::Path.new(:$dirname, :$basename);

# Check for existence
$file.f

True

Import the vector database:

In [4]:
my $tstart = now;

$vdbObj.import($fileName);

my $tend = now;

say "Import time { $tend - $tstart } seconds.";

Import time 4.201005006 seconds.


In [5]:
say $vdbObj;

VectorDatabase(:id(""), :name("No833"), :elements(408), :sources(408))


Show text chunks sample:

In [6]:
#% html
$vdbObj.text-chunks.pick(3) ==> to-html()

0,1
336.0,"So it's a book from the 18 hundreds, and it is about a sphere that goes to visit a two dimensional world, a flatland. So in the two dimensional world, you have different shapes and the shapes denote the class."

0,1
193.0,Yeah. Can you explain? Because it's one of the coolest things that I've learned.

0,1
406.0,"I tried to do it. I tried to get that. The line of cocaine you said you didn't want when we went to the bathroom. Look, I'm trying to embody my old club promoter world, but, dude, I really do appreciate you. I look forward to making sense of what happens over the next couple of months at some point next year."


Show vector database dimensions:

In [7]:
$vdbObj.database.&dimensions

(408 1536)

Show vectors' norms:

In [8]:
$vdbObj.database.pick(3).deepmap({ norm($_.value) })

(1.0000000286751893 1.0000000280378905 1.000000056732192)

Here is a summary over all vectors:

In [9]:
sink records-summary($vdbObj.database.values.map({ norm($_) }))

+------------------------------+
| numerical                    |
+------------------------------+
| 3rd-Qu => 1.0000000392624304 |
| Mean   => 1.000000015594975  |
| Median => 1.0000000139958338 |
| Max    => 1.0000001113338148 |
| 1st-Qu => 0.9999999920348879 |
| Min    => 0.9999999106319158 |
+------------------------------+


------

## Themes found in the text chunks

The vector database object has the attributes:
- `database` which is a `Map` of labels (identifiers) to LLM embedding vectors
- `text-chunks` which is a `Map` of labels to text chunks that correspond to the embedding vectors

(The keys of `database` and `text-chunks` are the same.)


By examining the LLM-extracted themes from text chunks of the imported vector database, 
we see that the discussion they came from is [fairy eclectic](https://www.youtube.com/watch?v=PYRYXhU4kxM), [CWv1]:

In [15]:
#% html
my $res = llm-synthesize([
    llm-prompt("ThemeTableJSON")(
        $vdbObj.text-chunks.sort(*.key)».value.join("\n"), 'text', 15, 
    )
    ],
    e => $conf4o,
    form => sub-parser('JSON'):drop
);

$res ==> to-html(field-names => <theme content>, align => 'left')

theme,content
Introduction and Guest Overview,"Introduction of Eric Weinstein, his background, and topics to be discussed."
2024 Presidential Election,"Discussion on the 2024 election, Joe Biden, Donald Trump, and censorship."
Rules-Based International Order,Explanation of the international order and its impact on U.S. politics.
Trump's Presidency and Populism,"Analysis of Trump's presidency, populism, and its effects on international alliances."
Democracy Paradox,Discussion on the paradox of democracy and institutional strength.
Media Manipulation,"Insights on media manipulation, fake news, and the role of major news outlets."
Conspiracy Theories,"Examination of conspiracy theories, responsible theorizing, and historical examples."
Retconning and Managed Reality,"Discussion on retroactive continuity, managed reality, and the impact on public perception."
Scientific Stagnation and String Theory,"Critique of string theory, scientific stagnation, and the need for new approaches."
Criticism Capture,"Exploration of criticism capture, its effects on public figures, and the accuracy budget."


**Remark:** It is instructive to compare the extracted themes with the list video segments given in [CWv1]. 

-----

## Nearest neighbors finding

Here is a query:

In [10]:
my $query = 'What is the state of string theory?';

What is the state of string theory?

Here we find the labels of the vectors (and text chunks) that _considered_ nearest neighbors of certain query:

In [21]:
my @nnLabels = $vdbObj.nearest($query, 10, prop => <label distance>, distance-function => &euclidean-distance)

[(119.0 0.8134249468076029) (112.0 0.9080564755048836) (133.0 0.9377928180417803) (120.0 0.9573860154673669) (100.0 0.9909664988765275) (135.0 0.9922121082582246) (115.0 1.0233996483306704) (129.0 1.027231944845438) (131.0 1.033582027704631) (102.0 1.0574237432559146)]

Here we make the corresponding dataset that included the text chunks from the vector database:

In [24]:
my @dsScores = @nnLabels.map({
    %( label => $_[0], distance => $_[1], text => $vdbObj.text-chunks{$_[0]} )
});

@dsScores.&dimensions

(10 3)

Here we show the dataset:

In [25]:
#% html
@dsScores ==> to-html(field-names => <distance label text>, align => 'left');

distance,label,text
0.8134249468076029,119.0,But is somebody at the forefront of string theory?
0.9080564755048836,112.0,"It's not shiny. That is saying everything else is crap and dangerous. In other words, string theory can't sell itself as physics by any telling of the story. String theory is the most failed theory in the history of physics. If you look at the number of papers, the amount of money, the number of people, the number of PhDs, number of conferences, achievements in physics proper per investment or size of effort, it is the most failed theory in the history of physics. And the way in which it survives is by hunting and destroying its enemies and making its enemies dependent on them. We all have a circuit in our brain that we're going to run to the string theorists to talk about the problem with string theory because of peer review. It's like when I want to report the police department for being corrupt, you should go to the police with that. Wait, you're not understanding. So that's the problem."
0.9377928180417804,133.0,"And I saw a tweet saying that somebody had been to a string theory convention and had asked the question, what is string theory? And the best string theorists on the planet came up with the answer, we kind of don't know what string theory is."
0.9573860154673668,120.0,"And he said, quote, I can tell you with absolute certainty string theory is not the theory of the real world. I can tell you that 100%. My strong feelings are exactly that. String theory is definitely not the theory of the real world. Is that taking it out of context? Is that him framing it somewhere else? Or does that encapsulate the fact that he thinks string theory is a dead end that doesn't describe the world?"
0.9909664988765275,100.0,This is the big question. We don't know whether that we're talking about the stagnation of theoretical physics or just nuclear physics.
0.9922121082582246,135.0,"Yeah. Is this too far gone for string theory now, is it? The mask is beginning to slip to the point where even Ed Dutton is going to have to eat his words within the next decade."
1.0233996483306704,115.0,"I think I'm just trying to say it's the problem with string theory, not the equations, not the shininess, not the advertising campaign. The problem is, look at how they treat everyone else. Everyone who is not a string theorist, who is trying to do stuff that could end up as a deemed export or as restricted data is covered and splattered in shit."
1.027231944845438,129.0,Who else would you want to have a chat with the guys on the string theory side of the world?
1.033582027704631,131.0,"So I don't know much or anything really about the inner workings of string theory, but Sabine Hossenfeld has been on the show, Brian Greene's been on the show, Sean Carroll's been on the show."
1.0574237432559146,102.0,"I'll do the decision tree. One possibility is that they're simply saying that they made nuclear physics very, very difficult to do, and that has to do with not very sexy physics, the physics of protons and neutrons and nuclei. So that branch exists. The other branch says, we used string theory to cock block actual progress in theoretical physics and derailed an entire field, at least in public."


-----

## Nearest neighbors finding (low-level)

In this section we show how to find the elements of the vector database that are _considered_ nearest neighbors to the query vector. 
We use "low-level" computations for didactic purposes.
Same or similar results can be obtained by using the method `nearest` of the vector database object.

Here is its vector embedding (with the same LLM that was used to compute the vector database):

In [27]:
my @query-vector = |llm-embedding($query, llm-evaluator => $vdbObj.llm-configuration).head;

@query-vector.elems

1536

For each vector in the database find its distance to the query vector:

In [28]:
my @dsScores =
        $vdbObj.database.map({ %(
            label => $_.key,
            distance => euclidean-distance($_.value, @query-vector),
            text => $vdbObj.text-chunks{$_.key}
        ) }).Array;

@dsScores.&dimensions

(408 3)

Sort in ascending order:

In [29]:
@dsScores .= sort({ $_<distance> });

@dsScores.map(*<distance>).head(6)

(0.8135070843089548 0.9081004694861233 0.9377962967576634 0.9573803222849635 0.9910115736190397 0.992206146436619)

Show the text chunks closest to the query:

In [30]:
#% html
@dsScores.head(12) ==> to-html(field-names => <distance label text>, align => 'left');


distance,label,text
0.8135070843089548,119.0,But is somebody at the forefront of string theory?
0.9081004694861232,112.0,"It's not shiny. That is saying everything else is crap and dangerous. In other words, string theory can't sell itself as physics by any telling of the story. String theory is the most failed theory in the history of physics. If you look at the number of papers, the amount of money, the number of people, the number of PhDs, number of conferences, achievements in physics proper per investment or size of effort, it is the most failed theory in the history of physics. And the way in which it survives is by hunting and destroying its enemies and making its enemies dependent on them. We all have a circuit in our brain that we're going to run to the string theorists to talk about the problem with string theory because of peer review. It's like when I want to report the police department for being corrupt, you should go to the police with that. Wait, you're not understanding. So that's the problem."
0.9377962967576634,133.0,"And I saw a tweet saying that somebody had been to a string theory convention and had asked the question, what is string theory? And the best string theorists on the planet came up with the answer, we kind of don't know what string theory is."
0.9573803222849636,120.0,"And he said, quote, I can tell you with absolute certainty string theory is not the theory of the real world. I can tell you that 100%. My strong feelings are exactly that. String theory is definitely not the theory of the real world. Is that taking it out of context? Is that him framing it somewhere else? Or does that encapsulate the fact that he thinks string theory is a dead end that doesn't describe the world?"
0.9910115736190396,100.0,This is the big question. We don't know whether that we're talking about the stagnation of theoretical physics or just nuclear physics.
0.992206146436619,135.0,"Yeah. Is this too far gone for string theory now, is it? The mask is beginning to slip to the point where even Ed Dutton is going to have to eat his words within the next decade."
1.0233900207921507,115.0,"I think I'm just trying to say it's the problem with string theory, not the equations, not the shininess, not the advertising campaign. The problem is, look at how they treat everyone else. Everyone who is not a string theorist, who is trying to do stuff that could end up as a deemed export or as restricted data is covered and splattered in shit."
1.0272712184226906,129.0,Who else would you want to have a chat with the guys on the string theory side of the world?
1.0336534449821548,131.0,"So I don't know much or anything really about the inner workings of string theory, but Sabine Hossenfeld has been on the show, Brian Greene's been on the show, Sean Carroll's been on the show."
1.057460017810909,102.0,"I'll do the decision tree. One possibility is that they're simply saying that they made nuclear physics very, very difficult to do, and that has to do with not very sexy physics, the physics of protons and neutrons and nuclei. So that branch exists. The other branch says, we used string theory to cock block actual progress in theoretical physics and derailed an entire field, at least in public."


Show the nearest neighbors scores (with the chosen distance function):

In [31]:
text-list-plot(@dsScores.map(*<distance>), width => 100, height => 16)

+----+---------------------+---------------------+--------------------+---------------------+------+      
+                                                                                                  +  1.50
+                                                                                         ****     +  1.40
|                                                           *******************************        |      
+                            *******************************                                       +  1.30
|               **************                                                                     |      
+          ******                                                                                  +  1.20
|        ***                                                                                       |      
+       **                                                                                         +  1.10
|      **                            

From the plot we can see that there are clear outliers. Here are find outliers' positions, [AA1, AAp8]:

In [32]:
my @pos = outlier-identifier(@dsScores.map(*<distance>), identifier => (&bottom-outliers o &quartile-identifier-parameters));

@pos.max

64

**Remark:** We show only the max outlier position since the distances are sorted in ascending order.

**Remark:** The outlier identifiers `&hampel-identifier-parameters` and `&splus-quartile-identifier-parameters` give 84 and 22 outliers, respectively.

The text chunks correspond to the found outliers are considered most relevant to the query and can be used to form prompt LLM answers for the query.

-----

## Answer based on nearest neighbors

Generate an answer:

In [47]:
#% markdown
my $answer = llm-synthesize([
    'Come up with a narration answering this question:',
    $query,
    "using these discussion statements:",
    @dsScores.head(40).map(*<text>).join("\n")
    ],
    e => $conf4o
);

$answer

The state of string theory, once a shining beacon of hope in the realm of theoretical physics, has become a subject of intense debate and controversy. At the forefront of this discussion are prominent figures like Sabine Hossenfelder, Brian Greene, and Sean Carroll, who have all weighed in on the matter, each bringing their unique perspectives and criticisms.

String theory, which once promised to be the "theory of everything," has struggled to deliver on its grandiose claims. Despite the vast number of papers, conferences, and PhDs dedicated to it, the theory has yet to produce concrete, verifiable predictions that align with the physical world. Critics argue that it has become the most failed theory in the history of physics, given the immense resources invested in it compared to its tangible achievements.

The theory's survival seems to hinge on a combination of academic politics and a reluctance to abandon a once-promising idea. String theorists have been accused of stifling alternative approaches and maintaining their dominance through peer review and academic gatekeeping. This has led to a situation where even asking fundamental questions about the validity of string theory can be seen as a career risk.

A telling anecdote from a recent string theory convention highlights the theory's current predicament. When asked, "What is string theory?" the best minds in the field could only muster a vague, uncertain response: "We kind of don't know what string theory is." This admission underscores the theory's ambiguous status and the growing frustration among physicists who feel that string theory has diverted attention from more promising avenues of research.

Prominent critics like Peter Woit and Lee Smolin have been vocal about the shortcomings of string theory. Woit, in particular, has gained recognition for his comprehensive critiques and alternative theories, which, while controversial, have sparked important discussions about the direction of theoretical physics.

On the other hand, defenders of string theory, such as Leonard Susskind, argue that the mathematical elegance and potential insights offered by the theory should not be dismissed outright. However, even Susskind's staunchest supporters acknowledge that the theory has not yet fulfilled its promise and that its proponents must answer for the decades of unfulfilled expectations.

The debate over string theory also reflects broader issues within the scientific community. The focus on high-profile, speculative theories has sometimes overshadowed more pragmatic, experimentally verifiable research. This has led to calls for a reevaluation of funding priorities and a renewed emphasis on empirical science.

In conclusion, the state of string theory is one of uncertainty and introspection. While it remains a topic of significant interest and debate, its future is unclear. Theoretical physics may need to pivot towards new ideas and approaches, leaving behind the allure of string theory in favor of more grounded, experimentally verifiable pursuits. The coming years will likely see a continued reevaluation of string theory's place in the scientific landscape, as physicists seek to balance the allure of elegant mathematics with the demands of empirical validation.

-------

## References

### Articles

[AA1] Anton Antonov, 
["Outlier detection in a list of numbers"](https://rakuforprediction.wordpress.com/2022/05/29/outlier-detection-in-a-list-of-numbers/),
(2022),
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

### Packages

[AAp1] Anton Antonov,
[WWW::OpenAI Raku package](https://github.com/antononcube/Raku-WWW-OpenAI),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov,
[WWW::PaLM Raku package](https://github.com/antononcube/Raku-WWW-PaLM),
(2023),
[GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov,
[LLM::Functions Raku package](https://github.com/antononcube/Raku-LLM-Functions),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov,
[LLM::Prompts Raku package](https://github.com/antononcube/Raku-LLM-Prompts),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov,
[ML::FindTextualAnswer Raku package](https://github.com/antononcube/Raku-ML-FindTextualAnswer),
(2023-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp6] Anton Antonov,
[Math::Nearest Raku package](https://github.com/antononcube/Raku-Math-Nearest),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp7] Anton Antonov,
[Math::DistanceFunctions Raku package](https://github.com/antononcube/Raku-Math-DistanceFunctions),
(2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp8] Anton Antonov,
[Statistics::OutlierIdentifiers Raku package](https://github.com/antononcube/Raku-Statistics-OutlierIdentifiers),
(2022),
[GitHub/antononcube](https://github.com/antononcube).

## Videos

[CWv1] Chris Williamson,
["Eric Weinstein - Are We On The Brink Of A Revolution? (4K)"](https://www.youtube.com/watch?v=PYRYXhU4kxM),
(2024),
[YouTube/@ChrisWillx](https://www.youtube.com/@ChrisWillx).   
([transcript](https://podscripts.co/podcasts/modern-wisdom/833-eric-weinstein-are-we-on-the-brink-of-a-revolution).)