# Building External Knowledge from PAQ dataset

This module is used to build external knowledge from PAQ dataset that serves as the knowledge source to the RAG system. It uses LlamIndex as the data framework and Pinecore to store the vector embeddings of PAQ dataset.

**LlamIndex**: https://docs.llamaindex.ai/en/v0.10.19/index.html

**Pinecore** as a vector database

**Embedding model**: BAAI / bge-small-en https://huggingface.co/BAAI/bge-small-en

**PAQ Dataset**: https://github.com/facebookresearch/PAQ?tab=readme-ov-file#paq-qa-pairs

### Load data

PAQ QA-pair

In [1]:
paq_file_path = '/content/drive/MyDrive/Berkeley/MIDS/DATASCI 266/project/data/PAQ/PAQ_L1.filtered.jsonl'

In [2]:
import json

with open(paq_file_path, 'r') as json_file:
    paq_data_list = list(json_file)

Show length and top records of data

In [3]:
len(paq_data_list)

14143704

In [4]:
for result in paq_data_list[:10]:
  print(result)

{"question":"how many popes have the name gregory","answer":["Sixteen"]}

{"question":"who is the book of a thousand days based on","answer":["Maid Maleen"]}

{"question":"who does lady saren marry in book of a thousand days","answer":["Lord Khasar"]}

{"question":"when was book of a thousand days written","answer":["2007"]}

{"question":"what is the order of the books in maid maleen","answer":["Book of a Thousand Days"]}

{"question":"where does book of a thousand days take place","answer":["steppes of the Eight Realms"]}

{"question":"where does the term light unto the nations come from","answer":["the prophet Isaiah"]}

{"question":"where did the term light unto the nations come from","answer":["the prophet Isaiah"]}

{"question":"who said the nations are the light of the world","answer":["Rashi"]}

{"question":"who is the main character in the witcher series","answer":["Geralt of Rivia"]}



### Embedding model

In [5]:
#%pip install llama-index-readers-file
%pip install llama-index-vector-stores-pinecone
%pip install llama-index-embeddings-huggingface

Collecting llama-index-vector-stores-pinecone
  Downloading llama_index_vector_stores_pinecone-0.1.4-py3-none-any.whl (6.3 kB)
Collecting llama-index-core<0.11.0,>=0.10.11.post1 (from llama-index-vector-stores-pinecone)
  Downloading llama_index_core-0.10.20.post2-py3-none-any.whl (15.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/15.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/15.4 MB[0m [31m6.3 MB/s[0m eta [36m0:00:03[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/15.4 MB[0m [31m31.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/15.4 MB[0m [31m59.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m10.8/15.4 MB[0m [31m110.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
import torch

# Check if a GPU is available and select the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# sentence transformers to embed QA pairs
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en",
    device=device,
    embed_batch_size=500)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### Build Pinecone Index

In [1]:
pinecone_name = "paq-qa-pairs-bge-small-en"
api_key = '********-****-****-****-***********'

In [5]:
from pinecone import Pinecone, ServerlessSpec
import os

pc = Pinecone(api_key=api_key)

In [None]:
pc.create_index(
    name=pinecone_name,
    dimension=384, # Replace with your model dimensions
    metric="euclidean", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-west-2"
    )
)

In [6]:
pinecone_index = pc.Index(pinecone_name)

### Create Vector store

In [10]:
from llama_index.vector_stores.pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

### Manually Construct Nodes, generate embeddings for each node and insert into vector store

In [11]:
from llama_index.core.schema import TextNode

START_INDEX = 14075500

# Process in batches
batch_size = 500  # Determine an optimal batch size based on your system and model
batches = [paq_data_list[i:i + batch_size] for i in range(START_INDEX, len(paq_data_list), batch_size)]
print(f"Number of batches: {len(batches)}")

nodes = []
for batch_idx, batch in enumerate(batches):
    texts = [row for row in batch]
    embeddings = embed_model.get_text_embedding_batch(texts)

    for text, embedding in zip(texts, embeddings):
        node = TextNode(text=text)
        node.embedding = embedding
        nodes.append(node)

    # Call vector_store.add(nodes) every 100,000 records or at the end of a batch
    if (batch_idx + 1) % (100000 // batch_size) == 0 or (batch_idx + 1) == len(batches):
        vector_store.add(nodes)
        print(f"Processed {(batch_idx + 1) * batch_size} records so far...")
        nodes = []  # Reset nodes after adding to the vector store

# Ensure any remaining nodes are added to the vector store
if nodes:
    vector_store.add(nodes)
    print(f"Processed the remaining {len(nodes)} records...")

# Print completion message
print(f"Completed. Total records processed: {len(paq_data_list)}")

Number of batches: 137


Upserted vectors:   0%|          | 0/68204 [00:00<?, ?it/s]

Processed 68500 records so far...
Completed. Total records processed: 14143704


### Confirm embedding vector data in Pinecore

In [10]:
# Fetch the index information
info = pinecone_index.describe_index_stats()
print(info)

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 14143704}},
 'total_vector_count': 14143704}


### Test embedding vector search

In [13]:
query_str = "where is conan meriadoc from?"

In [14]:
query_embedding = embed_model.get_query_embedding(query_str)

In [15]:
# construct vector store query
from llama_index.core.vector_stores import VectorStoreQuery

query_mode = "default"
# query_mode = "sparse"
# query_mode = "hybrid"

vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=5, mode=query_mode
)

In [16]:
# returns a VectorStoreQueryResult
query_result = vector_store.query(vector_store_query)

In [18]:
from llama_index.core.schema import NodeWithScore
from typing import Optional

nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
    score: Optional[float] = None
    if query_result.similarities is not None:
        score = query_result.similarities[index]
    nodes_with_scores.append(NodeWithScore(node=node, score=score))
    print(f"Content: {nodes_with_scores[index].get_content()}, Score: {nodes_with_scores[index].score}")

Content: {"question":"who is conan meriadoc mentioned alongside in armes prydein","answer":["Cadwaladr"]}
, Score: 0.365959167
Content: {"question":"who argued that conan meriadoc dates back to the mid 12th century","answer":["Hubert Guillotel"]}
, Score: 0.372904658
Content: {"question":"conan meriadoc was the roman name for whom","answer":["Magnus Maximus"]}
, Score: 0.380203128
Content: {"question":"who is the compiler of conan meriadoc","answer":["Gurheden"]}
, Score: 0.381121516
Content: {"question":"what nationality was conan meriadoc","answer":["British"]}
, Score: 0.394877195
