LLMs are frozen in time. They only know what they were last trained on. Unfortunately, constantly training on the latest documents is expensive and time-consuming. 

The fix is something called **retrieval augmentation**. This allows retrieval of information from an external source.

## Creating the Knowledge Base
There are two primary types of knowledge for LLMs:
1. parametric knowledge - everything learned during training
2. source knowledge - additional sources that the model can refer to

### Getting Data for our Knowledge Base
We will use a subset of wikipedia in this example. It will be retrieved via Hugging Face datasets.

In [53]:
# install dependencies
!pip3 install datasets huggingface tiktoken pinecone-client

Collecting pinecone-client
  Downloading pinecone_client-2.2.2-py3-none-any.whl (179 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m579.4 kB/s[0m eta [36m0:00:00[0m1m563.3 kB/s[0m eta [36m0:00:01[0m
Collecting loguru>=0.5.0 (from pinecone-client)
  Downloading loguru-0.7.0-py3-none-any.whl (59 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m636.1 kB/s[0m eta [36m0:00:00[0m1m770.5 kB/s[0m eta [36m0:00:01[0m
Collecting dnspython>=2.0.0 (from pinecone-client)
  Downloading dnspython-2.3.0-py3-none-any.whl (283 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 kB[0m [31m599.5 kB/s[0m eta [36m0:00:00[0m kB/s[0m eta [36m0:00:01[0m:01[0m
Installing collected packages: loguru, dnspython, pinecone-client
Successfully installed dnspython-2.3.0 loguru-0.7.0 pinecone-client-2.2.2


In [4]:
import json

In [29]:
# hit a snag with hugging face's datasets module, so downloaded locally
with open('wikipedia', 'r') as file:
    data = json.load(file)['rows'][:10000]

In [33]:
print(data[6]['row']['text'])

Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.

Early life and family 
Alan Turing was born in Maida Vale, London on 23 June 1912. His father was part of a family of merchants from Scotland. His mother, Ethel Sara, was the daughter of an engineer.

Education 
Turing went to St. Michael's, a school at 20 Charles Road, St Leonards-on-sea, when he was five years old.
"This is only a foretaste of what is to come, and only the shadow of what is going to be.” – Alan Turing.

The Stoney family were once prominent landlords, here in North Tipperary. His mother Ethel Sara Stoney (1881–1976) was daughter of Edward Waller Stoney (Borrisokane, North Tipperary) and Sarah Crawford (Cartron Abbey, Co. Longford); Protestant Anglo-Irish gentry.

Educated in Dublin at Alexandra School and College; on October 1st 1907 she married Julius Mathison Turing, latter son of Reverend Joh

### Creating Chunks
The text must be split into smaller chunks to be useful. The primary objectives are:
1. improve embedding accuracy
2. reduce the amount of text fed into LLM
3. very long texts can exceed the max content window for models

We have to measure the text in terms of tokens (words or sub-words, depending on the model).

In [21]:
import tiktoken

In [23]:
tokenizer = tiktoken.get_encoding('p50k_base')

def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
"we can find the length of this chunk of text in tokens")

28

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n","\n"," ", ""])

In [34]:
# split the text
chunks = text_splitter.split_text(data[6]['row']['text'])[:3]
chunks

['Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.\n\nEarly life and family \nAlan Turing was born in Maida Vale, London on 23 June 1912. His father was part of a family of merchants from Scotland. His mother, Ethel Sara, was the daughter of an engineer.\n\nEducation \nTuring went to St. Michael\'s, a school at 20 Charles Road, St Leonards-on-sea, when he was five years old.\n"This is only a foretaste of what is to come, and only the shadow of what is going to be.” – Alan Turing.\n\nThe Stoney family were once prominent landlords, here in North Tipperary. His mother Ethel Sara Stoney (1881–1976) was daughter of Edward Waller Stoney (Borrisokane, North Tipperary) and Sarah Crawford (Cartron Abbey, Co. Longford); Protestant Anglo-Irish gentry.\n\nEducated in Dublin at Alexandra School and College; on October 1st 1907 she married Julius Mathison Turing, latter son o

In [38]:
tiktoken_len(chunks[0]),tiktoken_len(chunks[1]),tiktoken_len(chunks[2])

(397, 304, 370)

### Creating Embeddings
The chunks of text are encoded into vector embeddings for use in the model. They are numerical representations of the text.

They are stored in a vector database, and the distances between them can be calculated.

We will use an embedding model called **text-embedding-ada-002***.

In [65]:
from langchain.embeddings.openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os
load_dotenv()

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(model=model_name)

In [66]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text'
]

res = embed.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

### Vector Database
We will use the Pinecone vector db.

In [73]:
import pinecone

index_name = 'langchain-retrieval-augmentation'

pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),
    environment=os.getenv('PINECONE_ENV')
    )

# create a new index
pinecone.create_index(
    name=index_name,
    metric='dotproduct',
    dimension=len(res[0]))


# unfortunately hit a wall here until pinecone's free tier server
# becomes responsive again. I'll pick up here again either on the 
# free tier or for paid version

UnauthorizedException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'www-authenticate': 'API key is missing or invalid for the environment "asia-southeast1-gcp-free". Check that the correct environment is specified.', 'content-length': '126', 'date': 'Fri, 14 Jul 2023 04:31:26 GMT', 'server': 'envoy'})
HTTP response body: API key is missing or invalid for the environment "asia-southeast1-gcp-free". Check that the correct environment is specified.
