<a href="https://colab.research.google.com/github/daawonn/RAG_LLM_study/blob/main/re_ranking_and_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install the required libraries:

In [1]:
pip install transformers faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


## Install the required libraries:


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
import numpy as np

### Load a pre-trained language model and tokenizer:

In [3]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Create a function to encode text using the language model:

In [4]:
def encode_text(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.logits.mean(dim=1).cpu().numpy()


### Prepare a dataset of question-answer pairs and encode them:

In [5]:
# Replace this with your actual dataset
qas = [
    ("What is the capital of France?", "Paris"),
    # Add more question-answer pairs here
]

# Encode question-answer pairs
encoded_qas = [encode_text(qa[0], tokenizer, model) for qa in qas]


### Build a FAISS index for efficient similarity search:

In [6]:
dimension = encoded_qas[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(encoded_qas))


### Implement a function to retrieve the top-k most similar question-answer pairs:

In [7]:
def retrieve_top_k(query, index, k):
    encoded_query = encode_text(query, tokenizer, model)
    distances, indices = index.search(np.array([encoded_query]), k)
    return [(qas[i], distances[0][j]) for j, i in enumerate(indices[0])]


### Implement a re-ranking function using another language model or a custom scoring function:

In [8]:
def re_rank(candidates, query):
    # You can implement your re-ranking logic here, e.g., using an ensemble model or custom scoring
    # For simplicity, we'll just sort the candidates by their original distances
    return sorted(candidates, key=lambda x: x[1])


### Test the retrieval and re-ranking:

In [9]:
query = "Where is the capital of France?"
top_k = 5

# Retrieve top-k candidates
candidates = retrieve_top_k(query, index, top_k)

# Re-rank the candidates
re_ranked_candidates = re_rank(candidates, query)

# Print the re-ranked results
for qa, distance in re_ranked_candidates:
    print(f"Question: {qa[0]}\nAnswer: {qa[1]}\nDistance: {distance}\n")


Question: What is the capital of France?
Answer: Paris
Distance: 0.0004824211646337062

Question: What is the capital of France?
Answer: Paris
Distance: 3.4028234663852886e+38

Question: What is the capital of France?
Answer: Paris
Distance: 3.4028234663852886e+38

Question: What is the capital of France?
Answer: Paris
Distance: 3.4028234663852886e+38

Question: What is the capital of France?
Answer: Paris
Distance: 3.4028234663852886e+38



In [10]:
# IPL dataset of question-answer pairs
ipl_qas = [
    ("Which team won the first IPL season?", "Rajasthan Royals"),
    ("Who scored the highest individual score in IPL?", "Chris Gayle (175*)"),
    ("Who is the leading run-scorer in IPL history?", "Virat Kohli"),
    ("Who is the leading wicket-taker in IPL history?", "Lasith Malinga"),
    ("Which team has won the most IPL titles?", "Mumbai Indians")
]

# Encode question-answer pairs
encoded_qas = [encode_text(qa[0], tokenizer, model) for qa in ipl_qas]


In [11]:
# Build a FAISS index for efficient similarity search
dimension = encoded_qas[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(encoded_qas))

# Implement a function to retrieve the top-k most similar question-answer pairs
def retrieve_top_k(query, index, k):
    encoded_query = encode_text(query, tokenizer, model)
    distances, indices = index.search(np.array([encoded_query]), k)
    return [(ipl_qas[i], distances[0][j]) for j, i in enumerate(indices[0])]

# Implement a re-ranking function using another language model or a custom scoring function
def re_rank(candidates, query):
    # You can implement your re-ranking logic here, e.g., using an ensemble model or custom scoring
    # For simplicity, we'll just sort the candidates by their original distances
    return sorted(candidates, key=lambda x: x[1])

# Test the retrieval and re-ranking with a sample query
query = "who is the caption of csk "
top_k = 3

# Retrieve top-k candidates
candidates = retrieve_top_k(query, index, top_k)

# Re-rank the candidates
re_ranked_candidates = re_rank(candidates, query)

# Print the re-ranked results
for qa, distance in re_ranked_candidates:
    print(f"Question: {qa[0]}\nAnswer: {qa[1]}\nDistance: {distance}\n")



Question: Who scored the highest individual score in IPL?
Answer: Chris Gayle (175*)
Distance: 0.0021723792888224125

Question: Which team won the first IPL season?
Answer: Rajasthan Royals
Distance: 0.0062577007338404655

Question: Who is the leading run-scorer in IPL history?
Answer: Virat Kohli
Distance: 0.006655266974121332



In [12]:
! pip install wikipedia~=1.4.0

Collecting wikipedia~=1.4.0
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11680 sha256=7e772583153ffd33af260d9387da2124f4a0893f25ed2630c8fe213aa7d8a3e9
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [13]:
!pip install langchain~=0.1.16

Collecting langchain~=0.1.16
  Downloading langchain-0.1.20-py3-none-any.whl (1.0 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.0 MB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/1.0 MB[0m [31m6.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━[0m [32m0.8/1.0 MB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain~=0.1.16)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting langchain-community<0.1,>=0.0.38 (from langchain~=0.1.16)
  Downloading langchain_community-0.0.38-py3-none-any.whl (2.0 MB)
[2K     [9

In [15]:
from langchain_community.document_loaders import WikipediaLoader
pages = WikipediaLoader(query="Harry Potter", load_max_docs=3, lang="en").load()

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")
nltk.download("stopwords")

# Load Wikipedia pages
pages = WikipediaLoader(query="Harry Potter", load_max_docs=3, lang="en").load()

# Preprocess the data
# Preprocess the data
stop_words = set(stopwords.words("english"))
wiki_qas = []

for page in pages:
    sentences = sent_tokenize(page.page_content)
    for sentence in sentences:
        words = word_tokenize(sentence)
        filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
        question = " ".join(filtered_words)
        answer = sentence
        if answer.strip():  # Check if the answer is not empty
            wiki_qas.append((question, answer))

# Encode question-answer pairs
encoded_qas = [encode_text(qa[0], tokenizer, model) for qa in wiki_qas]

encoded_qas = [encode_text(qa[0], tokenizer, model) for qa in wiki_qas]

# Build a FAISS index for efficient similarity search
dimension = encoded_qas[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(encoded_qas))

# Use the retrieve_top_k and re_rank functions provided in the previous code examples

# Test the retrieval and re-ranking with a sample query
query = "Who is Harry Potter?"
top_k = 3

def re_rank(candidates, query, num_candidates):
    # You can implement your re-ranking logic here, e.g., using an ensemble model or custom scoring
    # For simplicity, we'll just sort the candidates by their original distances
    return sorted(candidates[:num_candidates], key=lambda x: x[1])



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
def retrieve_top_k(query, index, k):
    encoded_query = encode_text(query, tokenizer, model)
    distances, indices = index.search(np.array([encoded_query]), k)
    return [(wiki_qas[i], distances[0][j]) for j, i in enumerate(indices[0])]


In [17]:
candidates = retrieve_top_k(query, index, len(encoded_qas))

# Re-rank the candidates
num_candidates = min(top_k, len(candidates))
re_ranked_candidates = re_rank(candidates, query, num_candidates)

# Print the re-ranked results
for qa, distance in re_ranked_candidates:
    print(f"Question: {qa[0]}\nAnswer: {qa[1]}\nDistance: {distance}\n")

Question: Harry broomstick flying ability earns place Gryffindor Quidditch team Seeker
Answer: Harry's broomstick flying ability earns him a place on the Gryffindor Quidditch team as the Seeker.
Distance: 8.639234465590562e-07

Question: British book awards judged children awards US
Answer: It won most of the British book awards that were judged by children and other awards in the US.
Distance: 1.601857320565614e-06

Question: cores Harry Lord Voldemort wands feathers phoenix bird making brothers
Answer: The cores of Harry and Lord Voldemort's wands have feathers from the same phoenix bird, making them "brothers".
Distance: 1.807837725209538e-05

