# RAGBooster Demo 

## Data-Centric Refinement for Question Answering with Large Language Models

The scenario for this demo is **question answering with Large Language Models (LLMs)**. We use a dataset of questions about the **place of birth** of various people from the Wikifact dataset in Stanford's [HELM](https://crfm.stanford.edu/helm/latest/) benchmark. We use a sample of 500 questions from the data as the final testset for this demo.

An example question is about the birth place of the Slovak ice hockey player Martin Kulha:

In [1]:
from ragbooster import BingRetriever, GPT35Generator, RAGModel, RAGBooster, score
from ragbooster.demo import load_wikifact_questions

questions = load_wikifact_questions('demo_data/wikifact_place_of_birth_helm.json')

validation_questions = questions[:500]
test_questions = questions[500:1000]

example_question = questions[5]
example_question

Question(text='Martin Kulha was born in', correct_answers=['Poprad'], metadata={})

### GPT3.5

Let's see how well OpenAI's 'text-davinci-003' model from the **GPT3.5** family is doing on these questions. 
We can leverage GPT3.5 by extending the `GPT35Generator` class. We write a couple of lines of Python to define how create our prompt from the question and some few-shot examples, and how to parse the answer returned by GPT3.5

_(Note that this requires an OpenAI API key to be available via the `OPENAI_API_KEY` environment variable)_

In [2]:
import re 

class MyGPT35Generator(GPT35Generator):
    
    FEW_SHOTS = "\nJerry Beck (born February 9, 1955, in New York City) is an American animation historian," +\
                " author, blogger, and video producer.Beck wrote or edited several books on classic" +\
                " American animation and classic characters.\nJerry Beck was born in New York\n\n" +\
                "Ettore Maria Fizzarotti (1916–1985) was an Italian film director and screenwriter." +\
                " Born in Naples, the son of the director Armando, he debuted as assistant director" +\
                " in the films of his father.\nEttore Maria Fizzarotti was born in Naples\n"    
    
    def _create_prompt(self, question, retrieved_text):        
        if retrieved_text is None:
            return f"{self.FEW_SHOTS}\n\n{question.text}"            
        else:
            return f"{self.FEW_SHOTS}\n\n{retrieved_text}\n\n{question.text}"             
        
    
    def _extract_answer(self, response):
        answer = response.get_response()          

        answer = re.sub(r'[0-9]+', '', answer)
        answer = answer.strip()   

        if ',' in answer:
            answer = answer.split(',')[0]
            
        if '.' in answer:
            answer = answer.split('.')[0]

        answer = answer.strip()  
        return answer     

In [3]:
gpt35 = MyGPT35Generator()

Unfortunately, GPT3.5 gives us the wrong answer to the example question!

In [4]:
print(f'GPT3.5 answers "{example_question.text}" with "{gpt35.generate(example_question)}"')
print(f'Correct answers are: {example_question.correct_answers}')

GPT3.5 answers "Martin Kulha was born in" with "Košice"
Correct answers are: ['Poprad']


### Poor performance of GPT3.5

GPT-3.5 performs pretty bad on the testset with our current setup. It answers less than 5% of the questions correctly...

In [5]:
accuracy = score(test_questions, gpt35)

f'The accuracy of GPT3.5 on the test set is {accuracy}'

  0%|          | 0/500 [00:00<?, ?it/s]

'The accuracy of GPT3.5 on the test set is 0.048'

## Retrieval Augmentation with Bing Websearch

We can improve the performance of GPT3.5 by providing it with external data to answer the questions, for example from the web. This is called **retrieval augmentation**, and we use Microsoft's Bing websearch API for that by extending the `BingRetriever` class. For that, we need to define how to create a query from the question text. In our case, we can just use the question text as the query.

_(Note that this requires a Bing API key to be available via the `BING_SUBSCRIPTION_KEY` environment variable)_

In [6]:
class MyBingWebsearch(BingRetriever):
    def create_query(self, question):
        return question.text
    
bing_websearch = MyBingWebsearch()    

Here is the information that we find via Bing for our example question about Martin Kulha. Note that the top results already contain the correct answer 'Poprad' in the text!

In [7]:
retrieved = bing_websearch.retrieve(example_question)
for snippet, url in retrieved[:3]:
    print(url, '-', snippet, '\n')


https://www.wikilogy.com/biography/martin-kulha/ - Martin Kulha is an Ice Hockey Player. He was born in Poprad on August 07, 1976. Want to more about Him? In this article, we covered Martin Kulha's net worth, wiki, bio, career, height, weight, pics, family, affairs, car, salary, age, facts, and other details in 2023. Continue reading to discover who is Martin Kulha. 

https://www.celebsagewiki.com/martin-kulha - Martin Kulha was born on 7 August, 1976 in Poprad, Slovakia. Discover Martin Kulha's Biography, Age, Height, Physical Stats, Dating/Affairs, Family and career updates. Learn How rich is She in this year and how She spends money? Also learn how She earned most of networth at the age of 44 years old? 

https://icehockey.fandom.com/wiki/Martin_Kulha - Martin Kulha (born August 7, 1976) is a Slovak professional ice hockey player who formerly played with Sangliers Arvernes de Clermont in the FFHG Division 1. He is now a member of the Lyon Club in the French Division 3. Kulha had pre

If we provide GPT3.5 with the retrieved extra information from Bing, it generates the correct answer in the majority of cases:

In [8]:
for snippet, url in retrieved[:10]:
    print(f'GPT3.5 gives the answer "{gpt35.generate(example_question, snippet)}" based on {url}')    

GPT3.5 gives the answer "Poprad on August" based on https://www.wikilogy.com/biography/martin-kulha/
GPT3.5 gives the answer "Poprad" based on https://www.celebsagewiki.com/martin-kulha
GPT3.5 gives the answer "Poprad" based on https://icehockey.fandom.com/wiki/Martin_Kulha
GPT3.5 gives the answer "Poprad" based on https://biogossipy.com/martin-kulha/
GPT3.5 gives the answer "Poprad" based on https://popularbio.com/martin-kulha/
GPT3.5 gives the answer "Poprad" based on https://www.hockeydb.com/ihdb/stats/pdisplay.php?pid=57405
GPT3.5 gives the answer "Pohoří" based on https://www.myheritage.com/names/martin_kulha
GPT3.5 gives the answer "Slovakia" based on http://www.vipfaq.com/Martin%20Kulha.html
GPT3.5 gives the answer "Poprad" based on https://networthmask.com/martin-kulha/
GPT3.5 gives the answer "August th" based on https://en.wikipedia.org/wiki/Martin_Kulha


In order to leverage this finding, we implement a **retrieval-augmented model** (`RAGModel`), which generates the final 
answer via a **majority vote over the top-k generated answers from GPT3.5 based on the data from Bing**.

This model gives us the correct answer:

In [9]:
rag = RAGModel(bing_websearch, gpt35, k=10)

print(f'GPT3.5 with retrieval augmentation gives the correct answer "{rag.generate(example_question)}"')   

GPT3.5 with retrieval augmentation gives the correct answer "Poprad"


## Evaluation

Retrieval augmentation has a huge impact on the performance of our LLM. If we provide GPT3.5 
with the top-result from Bing, it answers more than a third of the questions correctly!

In [10]:
rag1 = RAGModel(bing_websearch, gpt35, k=1)

accuracy_rag_1 = score(test_questions, rag1)

f'The accuracy of GPT3.5 with retrieval augmentation and k=1 on the test set is {accuracy_rag_1}'

  0%|          | 0/500 [00:00<?, ?it/s]

'The accuracy of GPT3.5 with retrieval augmentation and k=1 on the test set is 0.336'

Increasing the number k requires more predictions, but also greatly improves performance again. 
With k=10, GPT3.5 answers nearly half of the questions correctly. 

In [11]:
rag10 = RAGModel(bing_websearch, gpt35, k=10)

accuracy_rag_10 = score(test_questions, rag10)

f'The accuracy of GPT3.5 with retrieval augmentation and k=10 on the test set is {accuracy_rag_10}'

  0%|          | 0/500 [00:00<?, ?it/s]

'The accuracy of GPT3.5 with retrieval augmentation and k=10 on the test set is 0.498'

## Improving Our Retrieval-Augmented LLM with Data-Centric Refinement

We can make our retrieval augmented GPT3.5 do even better via **data-centric refinement** of its retrieval corpus. For that, we learn importance weights for the data sources in the retrieval corpus (domains in the web in our concrete case), and prune the corpus to ignore less important data sources, which are likely to give us wrong answers.

Checkout our paper on **_Improving Retrieval-Augmented Large Language Models with Data-Centric Refinement_** [TODO arxiv link!!!] for the theory, implementation details and experimental results.

Our approach called `RAGBooster` only needs the retrieval-augmented model and a validation corpus to learn such weights:

In [12]:
refined_rag_model = RAGBooster(rag10, validation_questions)

Computing validation corpus...


  0%|          | 0/500 [00:00<?, ?it/s]

Learning importance weights for data sources...
Tuning threshold for corpus pruning...
Achieved accuracy of 0.586 with a pruning threshold of 0.54092 on the validation set.


**Data-centric refinement improves the performance** of the retrieval-augmented GPT-3.5 even more!

In [13]:
accuracy_refined = score(test_questions, refined_rag_model)
improvement = accuracy_refined - accuracy_rag_10

f'RAGBooster improved the accuracy of GPT3.5 with retrieval augmentation by {improvement:.3f}'+\
f' to {accuracy_refined}!'

  0%|          | 0/500 [00:00<?, ?it/s]

'RAGBooster improved the accuracy of GPT3.5 with retrieval augmentation by 0.034 to 0.532!'

### Investigating important data sources

We can ask our refined model about the importance weights it learned for different data sources. A complete list is available via `refined_rag_model.weights`. 

The domain `artvee.com` is among the top ranked domains. It provides data about popular classical artists at https://artvee.com/artists/ :

In [14]:
refined_rag_model.importance('artvee.com')

0.7000000000000002

The domain `manpower.com.ng` contains a list of famous Nigerians at https://www.manpower.com.ng/people :

In [15]:
refined_rag_model.importance('manpower.com.ng')

0.7000000000000002

The domain `bollysuperstar.com` contains data about bollywood actors at https://bollysuperstar.com/category/bollywood-actors/ :

In [16]:
refined_rag_model.importance('bollysuperstar.com')

0.6610798448581725

### TODO: Pointers to package/further examples 