# Retrieval Augmented Generative Question Answering with Pinecone

#### Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources.

A common problem with using GPT-3 to factually answer questions is that GPT-3 can sometimes make things up. The GPT models have a broad range of general knowledge, but this does not necessarily apply to more specific information. For that we use the Pinecone vector database as our _"external knowledge base"_ — like *long-term memory for GPT-3.

Required installs for this notebook are:

In [1]:
!pip install -qU openai pinecone-client datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 KB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 KB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import openai

# get API key from top-right dropdown on OpenAI website
openai.api_key = "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"

For many questions *state-of-the-art (SOTA)* LLMs are more than capable of answering correctly.

In [4]:
query = "who was the 12th person on the moon and when did they land?"

# now query text-davinci-003 WITHOUT context
res = openai.Completion.create(
    engine='text-davinci-003',
    prompt=query,
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

res['choices'][0]['text'].strip()

'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

However, that isn't always the case. First let's first rewrite the above into a simple function so we're not rewriting this every time.

In [5]:
def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

Now let's ask a more specific question about training a type of transformer model called a *sentence transformer*. The ideal answer we'd be looking for is _"Multiple Negatives Ranking (MNR) loss"_.

Don't worry if this is a new term to you, it isn't required to understand what we're doing or demoing here.

In [43]:
query = (
    "What support can i use for infrastructure in developing nations? " )

complete(query)

'There are a variety of support options available for infrastructure development in developing nations. These include: \n\n1. International Development Banks: International development banks such as the World Bank, the African Development Bank, and the Asian Development Bank provide loans and grants to support infrastructure projects in developing countries. \n\n2. Private Sector Investment: Private sector investment can be used to finance infrastructure projects in developing countries. This can include direct investment from companies, venture capital, and private equity. \n\n3. Government Support: Governments in developing countries can provide support for infrastructure projects through subsidies, tax incentives, and other forms of financial assistance. \n\n4. International Donors: International donors such as the United Nations, the European Union, and the United States can provide grants and other forms of financial assistance to support infrastructure projects in developing coun

One of the common answers we get to this is:

```
The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.
```

This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but *cannot* be used to fine-tune a sentence-transformer, and has nothing to do with having _"pairs of related sentences"_.

An alternative answer we receive (and the one we returned above) is about `supervised learning approach` being the most suitable. This is completely true, but it's not specific and doesn't answer the question.

We have two options for enabling our LLM in understanding and correctly answering this question:

1. We fine-tune the LLM on text data covering the topic mentioned, likely on articles and papers talking about sentence transformers, semantic search training methods, etc.

2. We use **R**etrieval **A**ugmented **G**eneration (RAG), a technique that implements an information retrieval component to the generation process. Allowing us to retrieve relevant information and feed this information into the generation model as a *secondary* source of information.

We will demonstrate option **2**.

---

## Building a Knowledge Base

With option **2** the retrieval of relevant information requires an external _"Knowledge Base"_, a place where we can store and use to efficiently retrieve information. We can think of this as the external _long-term memory_ of our LLM.

We will need to retrieve information that is semantically related to our queries, to do this we need to use _"dense vector embeddings"_. These can be thought of as numerical representations of the *meaning* behind our sentences.

To create these dense vectors we use the `text-embedding-ada-002` model.

We have already authenticated our OpenAI connection, to create an embedding we just do:

In [27]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Pole timbers provide an inexpensive source of structural timber in many developing countries and are widely used for traditional buildings." ,
        "The use of round timber has considerable potential in comparison to the use of sawn timber because of its higher structural strength and its low material cost. There are, however, problems associated with working with non-uniform sections and also with jointing. The traditional methods ofjointing using sisal rope or strips of bark in rural areas do not permit the full strength of the poles to be utilised. Improved low cost methods of connecting poles could lead to stronger structures and more economical use of materials. This paper reviews aspects relating to the use of round timber and describes the design and fabrication of some low-cost, yet high-quality structural systems suitable for roof structures of modern buildings. Keywords: Round timber, Roof structures, Connections, Jointing methods",
        
    ], engine=embed_model
)

In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field.

In [28]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model.

In [29]:
len(res['data'])

2

In [30]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

We will apply this same embedding logic to a dataset containing information relevant to our query (and many other queries on the topics of ML and AI).

### Data Preparation

The dataset we will be using is the `jamescalam/youtube-transcriptions` from Hugging Face _Datasets_. It contains transcribed audio from several ML and tech YouTube channels. We download it with:

In [11]:
from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

Downloading readme:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

Downloading and preparing dataset json/jamescalam--youtube-transcriptions to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-08d889f6a5386b9b/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/79.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-08d889f6a5386b9b/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

In [31]:
data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'published': '2021-07-06 13:00:03 UTC',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'video_id': '35Pdoyi6ZoQ',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
 'id': '35Pdoyi6ZoQ-t0.0',
 'text': 'Hi, welcome to the video.',
 'start': 0.0,
 'end': 9.36}

The dataset contains many small snippets of text data. We will need to merge many snippets from each video to create more substantial chunks of text that contain more information.

In [38]:
from tqdm.auto import tqdm

new_data = []

window = 20  # number of sentences to combine
stride = 4  # number of sentences to 'stride' over, used to create overlap


new_data.append({
    'id' : "1",
    'text' : "Pole timbers provide an inexpensive source of structural timber in many developing countries and are widely used for traditional buildings. The use of round timber has considerable potential in comparison to the use of sawn timber because of its higher structural strength and its low material cost. There are, however, problems associated with working with non-uniform sections and also with jointing. The traditional methods ofjointing using sisal rope or strips of bark in rural areas do not permit the full strength of the poles to be utilised. Improved low cost methods of connecting poles could lead to stronger structures and more economical use of materials. This paper reviews aspects relating to the use of round timber and describes the design and fabrication of some low-cost, yet high-quality structural systems suitable for roof structures of modern buildings. Keywords: Round timber, Roof structures, Connections, Jointing methods"
})

In [39]:
new_data[0]

{'id': '1',
 'text': 'Pole timbers provide an inexpensive source of structural timber in many developing countries and are widely used for traditional buildings. The use of round timber has considerable potential in comparison to the use of sawn timber because of its higher structural strength and its low material cost. There are, however, problems associated with working with non-uniform sections and also with jointing. The traditional methods ofjointing using sisal rope or strips of bark in rural areas do not permit the full strength of the poles to be utilised. Improved low cost methods of connecting poles could lead to stronger structures and more economical use of materials. This paper reviews aspects relating to the use of round timber and describes the design and fabrication of some low-cost, yet high-quality structural systems suitable for roof structures of modern buildings. Keywords: Round timber, Roof structures, Connections, Jointing methods'}

Now we need a place to store these embeddings and enable a efficient _vector search_ through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io) and enter it below where we will initialize our connection to Pinecone and create a new index.

In [35]:
import pinecone

index_name = 'openai-youtube-transcriptions'

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="4b4b67b0-7ae4-4d7c-8bfe-55fd8ee595bc",
    environment="us-west4-gcp"  # may be different, check at app.pinecone.io
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='cosine',
        
    )
# connect to index
index = pinecone.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:

In [42]:
from tqdm.auto import tqdm
import datetime
from time import sleep

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(new_data), batch_size)):
    # find end of batch
    i_end = min(len(new_data), i+batch_size)
    meta_batch = new_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(input=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        
        'text': x['text'],
       
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/1 [00:00<?, ?it/s]

Now we search, for this we need to create a _query vector_ `xq`:

In [44]:
res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=1, include_metadata=True)

In [45]:
res

{'matches': [{'id': '1',
              'metadata': {'text': 'Pole timbers provide an inexpensive source '
                                   'of structural timber in many developing '
                                   'countries and are widely used for '
                                   'traditional buildings. The use of round '
                                   'timber has considerable potential in '
                                   'comparison to the use of sawn timber '
                                   'because of its higher structural strength '
                                   'and its low material cost. There are, '
                                   'however, problems associated with working '
                                   'with non-uniform sections and also with '
                                   'jointing. The traditional methods '
                                   'ofjointing using sisal rope or strips of '
                                   'bark in rural a

In [52]:
limit = 3750

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    res = index.query(xq, top_k=1, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(0, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

In [53]:
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

'Answer the question based on the context below.\n\nContext:\nPole timbers provide an inexpensive source of structural timber in many developing countries and are widely used for traditional buildings. The use of round timber has considerable potential in comparison to the use of sawn timber because of its higher structural strength and its low material cost. There are, however, problems associated with working with non-uniform sections and also with jointing. The traditional methods ofjointing using sisal rope or strips of bark in rural areas do not permit the full strength of the poles to be utilised. Improved low cost methods of connecting poles could lead to stronger structures and more economical use of materials. This paper reviews aspects relating to the use of round timber and describes the design and fabrication of some low-cost, yet high-quality structural systems suitable for roof structures of modern buildings. Keywords: Round timber, Roof structures, Connections, Jointing 

In [54]:
# then we complete the context-infused query
complete(query_with_contexts)

'Pole timbers provide an inexpensive source of structural timber in many developing countries and are widely used for traditional buildings. Improved low cost methods of connecting poles could lead to stronger structures and more economical use of materials.'

And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_).