***Retrieval-based vs Generative Chatbot Comparison***

In [1]:
!pip install -q sentence-transformers faiss-cpu transformers datasets accelerate torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline
from datasets import load_dataset
import torch
import numpy as np
import pandas as pd

In [3]:
def load_or_create_dataset(n_samples=200):
    try:
        ds = load_dataset("wiki_qa", split="train")
        ds = ds.filter(lambda x: x['label'] == 1)
        ds = ds.select(range(min(n_samples, len(ds))))
        print("Loaded WikiQA dataset with", len(ds), "samples.")
        return pd.DataFrame({"question": ds['question'], "answer": ds['answer']})
    except:
        print("Could not load WikiQA. Using fallback dataset.")
        data = {
            "question": [
                "What is machine learning?",
                "How does photosynthesis work?",
                "What causes rain?",
                "Who discovered electricity?",
                "What is quantum computing?"
            ],
            "answer": [
                "Machine learning is a field of AI that enables systems to learn from data.",
                "Photosynthesis is a process used by plants to convert light energy into chemical energy.",
                "Rain is caused by the condensation of water vapor in the atmosphere.",
                "Electricity was studied by Benjamin Franklin and others.",
                "Quantum computing uses quantum-mechanical phenomena to perform computations."
            ]
        }
        return pd.DataFrame(data)

df = load_or_create_dataset()
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/594k [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/264k [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/2.00M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/6165 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2733 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/20360 [00:00<?, ? examples/s]

Filter:   0%|          | 0/20360 [00:00<?, ? examples/s]

Loaded WikiQA dataset with 200 samples.


Unnamed: 0,question,answer
0,how are glacier caves formed?,A glacier cave is a cave formed within the ice...
1,how much is 1 tablespoon of water,This tablespoon has a capacity of about 15 mL.
2,how much is 1 tablespoon of water,In the USA one tablespoon (measurement unit) i...
3,how much is 1 tablespoon of water,In Australia one tablespoon (measurement unit)...
4,how much are the harry potter movies worth,The series also originated much tie-in merchan...


In [4]:
retrieval_model = SentenceTransformer('all-MiniLM-L6-v2')

corpus_embeddings = retrieval_model.encode(df['answer'].tolist(), convert_to_tensor=True)
print("Embeddings created:", corpus_embeddings.shape)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings created: torch.Size([200, 384])


In [5]:
generator = pipeline("text-generation", model="gpt2", max_new_tokens=80)  # lightweight model
print("Generative model loaded.")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Generative model loaded.


In [7]:
def retrieval_bot(query):
    query_emb = retrieval_model.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_emb, corpus_embeddings, top_k=1)
    top_hit = hits[0][0]
    return df['answer'].iloc[top_hit['corpus_id']]

In [8]:
def generative_bot(query):
    out = generator(query + " Answer:", max_new_tokens=60)[0]["generated_text"]
    return out

In [9]:
test_queries = [
    "What is machine learning?",
    "How do plants make food?",
    "Explain rain formation.",
    "Tell me about electricity discovery.",
    "What is quantum computing?"
]

results = []

for q in test_queries:
    ret = retrieval_bot(q)
    gen = generative_bot(q)
    results.append([q, ret, gen])

df_results = pd.DataFrame(results, columns=["Query", "Retrieval Response", "Generative Response"])
df_results

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Unnamed: 0,Query,Retrieval Response,Generative Response
0,What is machine learning?,A computer is a general purpose device that ca...,What is machine learning? Answer: Machine lear...
1,How do plants make food?,"It eats leaves, herbs, twigs and green plants ...","How do plants make food? Answer: ""Well, you ca..."
2,Explain rain formation.,"According to folklore, if it is cloudy when a ...",Explain rain formation. Answer: No.\n\nAnswer:...
3,Tell me about electricity discovery.,Others use a local power source such as a batt...,Tell me about electricity discovery. Answer: T...
4,What is quantum computing?,A computer is a general purpose device that ca...,What is quantum computing? Answer: Quantum com...
