<a href="https://colab.research.google.com/github/fredsiika/gpt-vector-agent/blob/main/getting_started/retrieval_augmentation_langchain_pinecone_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Solving LLM Hallucination with Knowledge Bases

- **Large Language Models (LLMs)** have a data freshness problem. Even some of the most powerful models, such as ChatGPT's `gpt-3.5-turbo` and `GPT-4`, have no idea about recent events.

- The world, according to Large Language Models, is frozen in time and they only know the world as it appeared through their training data. For example (ChatGPT's knowledge cuttoff is Sep 2021)

This creates problems for any use case that relies heavily on up-to-date information or a particular dataset. For example, you want to interact with internal company documents using a Large Language Model.

The first challenge is adding those documents to the Large Language Model. Attempting to train the Large Language Model on these documents is time-consuming and expensive. Moreover, training for every new document added is completely impractical. 

### So, how do we handle this problem? 

We can use **retrieval augmentation**. This technique allows us to retrieve relevant information from an external knowledge base and give that information to our Large Language Model.

The external knowledge base serves as our "window" into the world beyond the Large Language Model's training data. This colab workspace is my attempt at learning all about **implementing retrieval augmentation for Large Language Models** using [LangChain](https://python.langchain.com/en/latest/) and the [Pinecone vector database](https://www.pinecone.io/).



In [None]:
!pip install -qU langchain openai tiktoken pinecone-client[grpc] datasets

## Acquiring Data for our Knowledge Base

To provide our LLM with pertinent source knowledge, we must create a knowledge base. This begins with selecting a dataset that aligns with our intended use case. Depending on the application, the dataset may vary, such as code documentation for an LLM designed to assist with coding or company documents for an internal chatbot. 

The choice of dataset is critical as it directly impacts the quality of the knowledge base we are building for our LLM.
For our example, we will be using a subset of Wikipedia, which we can obtain through [Hugging Face datasets](https://huggingface.co/datasets/wikipedia).

<details>
    <summary>
    Click here for Dataset Summary
    </summary>
    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
    
    You can find the full list of languages and dates [here](https://dumps.wikimedia.org/backup-index.html).

    The articles are parsed using the `mwparserfromhell` tool. 
</details>

To load this dataset you need to install Apache Beam and `mwparserfromhell` first:

In [2]:
!pip install apache_beam mwparserfromhell

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting apache_beam
  Downloading apache_beam-2.48.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.3/14.3 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mwparserfromhell
  Downloading mwparserfromhell-0.6.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting crcmod<2.0,>=1.7 (from apache_beam)
  Downloading crcmod-1.7.tar.gz (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting orjson<4.0 (from apache_beam)
  Downloading orjson-3.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64

Some subsets of Wikipedia have already been processed by HuggingFace, which is what we'll be using. You can load them just with:

## Troubleshoot Installation Issues

If you experience any issues with the two previous steps related to `protobuf` library, try running the next two code blocks. 

Reach out to me on [Github](https://github.com/fredsiika/gpt-vector-agent) if you're still having issues.

In [3]:
import threading 


In [4]:
!pip3 install --upgrade google-auth
!pip3 install protobuf==3.19.6

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-auth
  Downloading google_auth-2.19.1-py2.py3-none-any.whl (181 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.3/181.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: google-auth
  Attempting uninstall: google-auth
    Found existing installation: google-auth 2.17.3
    Uninstalling google-auth-2.17.3:
      Successfully uninstalled google-auth-2.17.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires google-auth==2.17.3, but you have google-auth 2.19.1 which is incompatible.[0m[31m
[0mSuccessfully installed google-auth-2.19.1


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting protobuf==3.19.6
  Downloading protobuf-3.19.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.23.2
    Uninstalling protobuf-4.23.2:
      Successfully uninstalled protobuf-4.23.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.48.0 requires protobuf<4.24.0,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.
tensorflow 2.12.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.


## Load the dataset

Some subsets of Wikipedia have already been processed by HuggingFace, which is what we'll be using. You can load them just with:

In [5]:
from datasets import load_dataset

data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')

Downloading builder script:   0%|          | 0.00/35.9k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading and preparing dataset wikipedia/20220301.simple to /root/.cache/huggingface/datasets/wikipedia/20220301.simple/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...


Downloading:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/235M [00:00<?, ?B/s]

ImportError: ignored

In [None]:
data

Most datasets will contain records that include a lot of text. Because of this, our first task is usually to build a preprocessing pipeline that chunks those long bits of text into more concise chunks.

## Creating Chunks

Splitting our text into smaller chunks is essential for following reasons:

- Improve “embedding accuracy” — this will improve the 
 relevance of results later.
- Reduce the amount of text fed into our LLM as source knowledge. Limiting input improves the LLM’s ability to follow instructions, reduces generation costs, and helps us get faster responses.
Provide users with more precise information sources as we can narrow down the information source to a smaller chunk of text.
- In the case of _very long_ chunks of text, we will exceed the maximum context window of our embedding or completion models. Splitting these chunks makes it possible to add these longer documents to our knowledge base.

To create these chunks, we first need a way of measuring the length of our text. LLMs don’t measure text by word or character — they measure it by “tokens”.

A token is typically the size of a word or sub-word and varies by LLM. The tokens themselves are built using a _tokenizer_. We will be using `gpt-3.5-turbo` as our completion model, and we can initialize the tokenizer for this model like so:



In [None]:
import tiktoken  # !pip install tiktoken

tokenizer = tiktoken.get_encoding('p50k_base')

Using the tokenizer, we can create tokens from plain text and count the number of tokens. We will wrap this into a function called `tiktoken_len`:

In [None]:
# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

With our token counting function ready, we can initialize a LangChain `RecursiveCharacterTextSplitter` object. This object will allow us to split our text into chunks no longer than what we specify via the `chunk_size` parameter.



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [None]:
chunks = text_splitter.split_text(data[6]['text'])[:3]
chunks

None of these chunks are larger than the `400` chunk size limit we set earlier:

With the `text_splitter`, we get nicely-sized chunks of text. We’ll use this functionality during the indexing process later. For now, let’s take a look at _embeddings_.

