# RAGBooster

We detail how to improve the performance for data imputation with retrieval augmentation and learned data pruning. We showcase how these techniques allow a tiny model to perform within 5% accuracy range of a large commercial LLM with 175 billion parameters.

Our library "RAGBooster" for learned data pruning is available as open source.

### Setup

In order to run this demo, we need access to GPT3.5 and the Bing web API. Note that we implement caching and can serve the vast majority of requests from our cache for this particular demo setup.

 1. **Access to GPT3.5 from OpenAI**: This demo notebook leverages GPT3.5 via the OpenAI API. It requires you to make your [OpenAI API key](https://platform.openai.com/account/api-keys) available as an environment variable via the following command:<br/><br/>`export OPENAI_API_KEY=your_secret_openai_key`<br/><br/>
 
 1. **Access to the Bing Websearch API**: Furthermore, we will query the web via [Microsoft Bing's websearch API](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api). You need to make your Bing API key available as an environment variable via the following command:<br/><br/>`export BING_SUBSCRIPTION_KEY=your_secret_bing_key`<br/><br/>

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split

from ragbooster import BingRetriever, HuggingfaceQAGenerator, Generator, RetrievalAugmentedModel, RAGBooster, score
from ragbooster.demo import load_imputation_dataset

We leverage a tabular dataset about restaurants, where the task is to impute the `city` attribute 
based on the `name`, `address` and `phone` number.

In [2]:
np.random.seed(42)

questions = load_imputation_dataset('demo_data/restaurant.csv', 
                                    impute='city', 
                                    based_on=['name', 'address', 'phone'])

The first question concerns a restaurant from los angeles:

In [3]:
validation_questions, test_questions = train_test_split(questions, test_size=0.5)
validation_questions[0]

Question(text='name: border grill; address: 4th st.; phone: 310/451-1655', correct_answers=['los angeles'], metadata={})

In this demo, we showcase how far we can boost the performance of small model designed for question answering. In particular, we use the [deepset/minilm-uncased-squad2](https://huggingface.co/deepset/minilm-uncased-squad2) model, and extend the `HuggingfaceQAGenerator` class to generate predictions with it.

In [4]:
class QAGenerator(HuggingfaceQAGenerator):
    
    def __init__(self, model_name, cache_path):
        super().__init__(model_name, cache_path)
    
    def _create_prompt(self, question, params):
        return { 'question': "What is the name of the city in which this restaurant is located?",
                 'context': question.text }
    
    def _extract_answer(self, response):
        return response['answer'].lower()
    
minilm = QAGenerator('deepset/minilm-uncased-squad2', 'demo_data/qa-cache.pkl')    

Out of the box, this model performs pretty bad on our imputation task and only manages to predict 
less than 6% of the cities correctly.

In [5]:
accuracy = score(test_questions, minilm)

f'The accuracy of minilm is {accuracy}.'

  0%|          | 0/432 [00:00<?, ?it/s]

'The accuracy of minilm is 0.05555555555555555.'

### GPT3.5

Let's see how well GPT3.5 is doing on this task. We implement a custom `Generator` for GPT3.5, which uses a few shot prompt generated from the first five samples of the validation data (in the format proposed in a recent VLDB paper on [Can Foundation Models Wrangle Your Data?](https://www.vldb.org/pvldb/vol16/p738-narayan.pdf))

In [7]:
import re 
from manifest import Manifest 

class FewShotGenerator(Generator):
    
    FEW_SHOT_PROMPT = "name: border grill; address: 4th st.; phone: 310/451-1655; city? los angeles\n\n"+\
        "name: le soleil; address: 133 clement st.; phone: 415/668-4848; city? san francisco\n\n"+\
        "name: cypress club; address: 500 jackson st.; phone: 415/296-8555; city? san francisco\n\n"+\
        "name: west; address: 63rd street steakhouse 44 w. 63rd st.; phone: 212/246-6363; city? new york\n\n"+\
        "name: schatzi on main; address: 3110 main st.; phone: 310/399-4800; city? los angeles\n\n"

    def __init__(self, llm):
        super().__init__(llm=llm, max_tokens=10)    
    
    def _create_prompt(self, question, params):        
        return f"{self.FEW_SHOT_PROMPT}{question.text}; city?"   
    
    def _extract_answer(self, response):
        answer = response.get_response()          

        answer = re.sub(r'[0-9]+', '', answer)
        answer = answer.strip()   

        for sep in ['\n', ',', '.']:
            if sep in answer:
                answer = answer.split(sep)[0]

        return answer.strip()  
    


gpt35_client = Manifest(client_name="openai", engine="text-davinci-003",
                        cache_name="sqlite", cache_connection="demo_data/gpt35-cache.sqlite")

gpt35 = FewShotGenerator(llm=gpt35_client)    

GPT3.5 performs astonishingly well on this task, and imputes nearly 90% of the cities correctly. (Note that GPT3.5 has most certainly seen the data at training time, as it is a common evaluation dataset in academic research).

In [9]:
accuracy = score(test_questions, gpt35)

f'The accuracy of GPT3.5 is {accuracy}.'

  0%|          | 0/432 [00:00<?, ?it/s]

'The accuracy of GPT3.5 is 0.8981481481481481.'

## Retrieval Augmentation with Bing Websearch

We can improve the performance of our tiny model by providing it with some external data, for example from the web. This is called retrieval augmentation, and we use Microsoft's Bing websearch API for that by extending the `BingRetriever` class and defining how to create a query from the question text. In our case, we can just use the question text as the query.

In [10]:
class MyBingWebsearch(BingRetriever):
    
    def __init__(self, cache_path):
        super().__init__(cache_path)
    
    def create_query(self, question):
        return question.text
    
bing_websearch = MyBingWebsearch('demo_data/bing-cache.pkl')  

Let's look at an example to impute:

In [11]:
example = validation_questions[11]
example

Question(text="name: scala's bistro; address: 432 powell st.; phone: 415/395-8555", correct_answers=['san francisco'], metadata={})

Data from the web helps greatly here, the correct city 'san francisco' for this example is already contained in the top three answers retrieved from Bing.

In [13]:
retrieved = bing_websearch.retrieve(example)
for snippet, url in retrieved[:3]:
    print(url, '-', snippet, '\n')

https://tableagent.com/san-francisco/scalas-bistro/ - Reservations Scala's Bistro Reservations Date Time Party Size Business Info + − Leaflet | © OpenStreetMap Address: 432 Powell Street, San Francisco CA 94102 Cross Street: Post Street Location: San Francisco | Union Square Cuisine: French | Italian | Pasta | Cost: | Moderate Category: Fine Dining Star Rating: Reservations: Unknown 

https://www.yellowpages.com/san-francisco-ca/mip/scalas-bistro-4887204 - ﻿ $$$ Italian Restaurants, Bars, Continental Restaurants (2) (2076) 7.1 OPEN NOW Today: 8:00 am - 11:00 pm 21 YEARS IN BUSINESS Amenities: (415) 395-8555 Map & Directions 432 Powell StSan Francisco, CA 94102 Write a Review Is this your business? Customize this page. Claim This Business Hours Regular Hours Scala's Bistro 432 Powell St, San Francisco 

https://www.chamberofcommerce.com/united-states/california/san-francisco/italian-restaurant/2006879304-scala-s-bistro - Scala's Bistro at 432 Powell St, San Francisco, CA 94102. Get Scal

In order to leverage the web data, we implement a new `Generator`, which uses the retrieved text from the web for the predictions.

In [15]:
class QAGeneratorWithContext(HuggingfaceQAGenerator):
    
    def __init__(self, model_name, cache_path):
        super().__init__(model_name, cache_path)
    
    def _create_prompt(self, question, params):
        retrieved_context = params['retrieved_context']
        return { 'question': "What is the name of the city in which this restaurant is located?",
                 'context': f'{retrieved_context};{question.text}' }
    
    def _extract_answer(self, response):
        return response['answer'].lower()
    
minilm_ctx = QAGeneratorWithContext('deepset/minilm-uncased-squad2', 'demo_data/qa_ctx-cache.pkl')

If we enhance our minilm with retrieval augmentation using `k=3` answers, it correctly imputes our example:

In [17]:
rag3 = RetrievalAugmentedModel(bing_websearch, minilm_ctx, k=3)
rag3.generate(example)

'san francisco'

If we use retrieval augmentation and set `k=10`, our tiny model already achieves a performance of over 80%!

In [19]:
rag10 = RetrievalAugmentedModel(bing_websearch, minilm_ctx, k=10)

accuracy_rag_10 = score(test_questions, rag10)

f'The accuracy with retrieval augmentation and k=10 on the test set is {accuracy_rag_10}'

  0%|          | 0/432 [00:00<?, ?it/s]

'The accuracy with retrieval augmentation and k=10 on the test set is 0.8009259259259259'

## Improving the performance further with RAGBooster

We can further improve the performance of our retrieval-augmented model by learning the data importance of the retrieval sources (web domains in our case) and pruning the retrieval corpus accordingly. Checkout our recent paper on **Improving Retrieval-Augmented Large Language Models with Data-Centric Refinement** (TODO need arxiv version) for details on the algorithm behind this.

We can "boost" the performance of our model via the `RAGBooster` class and an additional set of validation questions as follows:

In [20]:
refined_rag_model = RAGBooster(rag10, validation_questions[5:])

  0%|          | 0/427 [00:00<?, ?it/s]

This "boosting" improves accuracy by more than 4% and brings us within about 5% of the performance achieved by the commercial LLM GPT3.5

In [21]:
accuracy_refined = score(test_questions, refined_rag_model)
improvement = accuracy_refined - accuracy_rag_10

f'RAGBooster improved the accuracy with retrieval augmentation by {improvement:.3f}'+\
f' to {accuracy_refined}!'

  0%|          | 0/432 [00:00<?, ?it/s]

'RAGBooster improved the accuracy with retrieval augmentation by 0.044 to 0.8449074074074074!'

We can finally inspect the most important data sources (domains), which RAGBooster identifies in our retrieval corpus:

In [23]:
domains_and_weights = refined_rag_model.weights
domains_and_weights_sorted = sorted(domains_and_weights.items(), key=lambda x:x[1], reverse=True)

domains_and_weights_sorted[:5]

[('hpi.de', 1.0),
 ('researchgate.net', 1.0),
 ('wu-wien.ac.at', 1.0),
 ('folkd.com', 0.9381223189772298),
 ('lasvegascasinos.com', 0.9231084964258172)]

Interestingly, the website of the Hasso Plattner Institute is among the top sources, which incidentally contains a dirty version of the actual restaurants data:

https://hpi.de/naumann/projects/repeatability/datasets/restaurants-dataset.html