# Retrieval Augmentation using Pinecone as the Vector Database

This is a follow-up notebook to build a retrieval augmentation system, which leverages Large Language Model and our own data to perform question answering, summarization, and other tasks.

In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv('../.env'))

True

## Building the Knowledge Base

In [2]:
from datasets import load_dataset

data = load_dataset("wikipedia", "20220301.simple", split="train[:10000]")
data

Found cached dataset wikipedia (/Users/mufin/.cache/huggingface/datasets/wikipedia/20220301.simple/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 10000
})

In [3]:
data[2]

{'id': '6',
 'url': 'https://simple.wikipedia.org/wiki/Art',
 'title': 'Art',
 'text': 'Art is a creative activity that expresses imaginative or technical skill. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing artifacts, and expressing the author\'s imaginative mind. The product of art is called a work of art, for others to experience.\n\nSome art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft.\n\nThose who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity.\n\n"The arts" is a much broader term. It includes drawing, painting, sculpting, photography, performance art, dance, music, poetry, prose and theatre.\n\nTypes of art \n\nArt is divided into the plastic arts, where something is made,

In [4]:
import tiktoken

tokenizer = tiktoken.encoding_for_model('gpt-3.5-turbo')

# Create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [5]:
tiktoken_len("Hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

Could not import azure.core python package.


In [7]:
# An example of splitting text
chunks = text_splitter.split_text(data[2]['text'])
print("The text is splitted into", len(chunks), "chunks")
chunks

The text is splitted into 4 chunks


['Art is a creative activity that expresses imaginative or technical skill. It produces a product, an object. Art is a diverse range of human activities in creating visual, performing artifacts, and expressing the author\'s imaginative mind. The product of art is called a work of art, for others to experience.\n\nSome art is useful in a practical sense, such as a sculptured clay bowl that can be used. That kind of art is sometimes called a craft.\n\nThose who make art are called artists. They hope to affect the emotions of people who experience it. Some people find art relaxing, exciting or informative. Some say people are driven to make art due to their inner creativity.\n\n"The arts" is a much broader term. It includes drawing, painting, sculpting, photography, performance art, dance, music, poetry, prose and theatre.\n\nTypes of art \n\nArt is divided into the plastic arts, where something is made, and the performing arts, where something is done by humans in action. The other divis

In [8]:
print("Length of every chunk:", [tiktoken_len(c) for c in chunks])

Length of every chunk: [332, 369, 391, 86]


## Create Embeddings

In [9]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'
embed = OpenAIEmbeddings(model=model_name)

In [10]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed.embed_documents(texts)
len(res), len(res[0]), len(res[1])

(2, 1536, 1536)

## Create Vector Database

Sadly, I cannot connect to Pinecone at the moment, but I will try to figure the issue here.

In [11]:
import pinecone

PINECONE_ENV_NAME = "Wiki"

index_name = 'langchain-retrieval-augmentation'
pinecone.init(
    environment=PINECONE_ENV_NAME
)

if index_name not in pinecone.list_indexes():
    # Create a new index
    pinecone.create_index(
        name=index_name,
        metric='dotproduct',
        dimension=len(res[0])
    )

KeyboardInterrupt: 

In [None]:
index = pinecone.GRPCIndex(index_name)
index.describe_index_stats()

## Indexing

In [None]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

In [None]:
index.describe_index_stats()

## Create LangChain Vector Store and Querying

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

In [None]:
query = "who was Benito Mussolini?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)