<a href="https://colab.research.google.com/github/VicentePina7210/DataMiningCleaningExercise/blob/main/Copy_of_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q&A Model Retrieval Augmented Generation
### Instructions
Copy this notebook and modify the code to create your own RAG pipeline on a dataset of choice

1. Get a set of text documents as your knowledge base
✅
2. Split the knowledge base into chunks for the vector store
✅
3. Customize and tune the RAG pipeline to give good answer (this will require some playing around with the code and research)
✅

Questions
1. Describe the dataset you chose and the types of questions your system will likely get
I generated a dataset using chat gpt, this data set was about gardening tips, tricks and terms with definitions.
I chose to  use this type of dataset because I thought it would be a good category with specific terms similar to eachother yet different by meaning.
2. What kinds of questions does your RAG pipeline perform good on and for what kinds of questions does it perform poorly? Give 3 examples of each and explain why this might be
Questions that are irrelevant cause hallucination, for example if i ask about history of  gardening that is not in the dataset it can generate some answers that are not at all relevant, furthermore the answers it generates are not even true statements, yet the model is confident with the answer.


3. Explain 3 methods you used to improve your RAG pipeline over the initial code and the affect it had on the outputs

In [2]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [3]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline

In [4]:
def cos_sim(x, y):
    """ Cosine similarity between two vectors """
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))


In [5]:
# Embedding model
embedder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Language model
generator = pipeline("text-generation", model="microsoft/phi-2", max_new_tokens=50)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Device set to use cuda:0


In [1]:
# Dummy knowledge base (like documents or facts we want to retrieve from) - Already split document into sentences
# Gardening knowledge base - Terms and Definitions
knowledge_base = [
    "Aeration is the process of loosening soil to allow better air, water, and nutrient penetration.",
    "Amendments are materials like compost and manure added to soil to improve its structure and fertility.",
    "Bare Root refers to plants sold without soil around their roots, typically dormant and ready for transplanting.",
    "Beneficial Insects are insects like ladybugs and bees that help in pest control and pollination.",
    "Biodiversity is the variety of plant and animal life in a garden that promotes ecosystem health.",
    "Bolting occurs when a plant prematurely flowers and produces seeds due to stress.",
    "Broadcast Seeding is the method of scattering seeds over a large area instead of planting in rows.",
    "Chlorosis is the yellowing of leaves due to a lack of chlorophyll, often caused by nutrient deficiencies.",
    "Compost is decomposed organic matter used to enrich soil and improve plant health.",
    "Companion Planting is the practice of growing different plants together for mutual benefits.",
    "Cover Crops are plants like clover or rye grown to improve soil quality and prevent erosion.",
    "Deadheading is the removal of spent flowers to encourage more blooms and prolong the flowering season.",
    "Direct Sowing means planting seeds directly in the garden instead of starting them indoors.",
    "Espalier is the technique of training plants to grow flat against a wall or trellis.",
    "Grafting is the process of joining two plant parts so they grow as one.",
    "Hardening Off is gradually acclimating indoor-grown plants to outdoor conditions before transplanting.",
    "Humus is decomposed organic matter in soil that improves structure and nutrient content.",
    "Invasive Species are non-native plants that spread aggressively and outcompete local flora.",
    "Loam is soil with a balanced mix of sand, silt, and clay, ideal for gardening.",
    "Mulching is adding a protective layer of material like straw or bark to soil to retain moisture.",
    "Mycorrhizae are beneficial fungi that form symbiotic relationships with plant roots to enhance nutrient absorption.",
    "NPK Ratio refers to the percentage of nitrogen, phosphorus, and potassium in fertilizers.",
    "Perlite is a lightweight volcanic rock used to improve soil drainage and aeration.",
    "Pollinators like bees and butterflies transfer pollen between flowers, enabling fertilization.",
    "Raised Beds are gardening areas built above ground level for improved soil control and drainage.",
    "Scarification is the process of breaking a seed’s outer coat to encourage germination.",
    "Succession Planting is planting crops in intervals to ensure continuous harvests.",
    "Trellis is a structure used to support climbing plants like beans, cucumbers, and vines.",
    "Vermiculite is a mineral used to retain moisture and improve soil aeration.",
    "Xeriscaping is a landscaping method using drought-tolerant plants to reduce water use."
]


In [None]:
# Encode the knowledge base into dense vector embeddings
doc_embeddings = embedder.encode(knowledge_base, convert_to_numpy=True)

# Create and populate a FAISS index
# We'll use an index based on cosine similarity
embedding_dim = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(embedding_dim)  # Inner product = cosine similarity if normalized

# Normalize vectors to use cosine similarity
faiss.normalize_L2(doc_embeddings)

# Add document embeddings to the index
index.add(doc_embeddings)

In [None]:
# Define the RAG function
def rag_respond(query, k=3):
    """
    Given a query:
    1. Embed it using the same model.
    2. Use FAISS to find top-k similar knowledge base entries.
    3. Concatenate the retrieved info with the query.
    4. Generate a response using a text generation model.
    """
    # Embed the query and normalize
    query_embedding = embedder.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding) # Not needed since encode normalizes by default

    # Retrieve top-k most similar docs from the knowledge base
    scores, indices = index.search(query_embedding, k)
    retrieved_docs = [knowledge_base[i] for i in indices[0]]

    # Concatenate retrieved knowledge with the query
    context = "\n".join(retrieved_docs)
    prompt = f"Instructions: Answer the question using the context and generate no other text.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"

    # Generate answer using the prompt
    result = generator(prompt)[0]["generated_text"]
    return result

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Instructions: Answer the question using the context and generate no other text.\n\nContext:\nThe tallest mountain in the world is Mount Everest.\nThe Pacific Ocean is the largest ocean on Earth.\nThe Great Wall of China is visible from space.\n\nQuestion: What is the highest mountain\nAnswer: Mount Everest\n'

In [None]:
# Test the system
queries = [
    "What is the capital of France?",
    "Tell me about the largest ocean.",
    "How do plants make their food?",
    "Who developed the theory of relativity?",
    "What's a famous play by Shakespeare?"
]

for q in queries:
    print("="*60)
    print(rag_respond(q))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instructions: Answer the question using the context and generate no other text.

Context:
The capital of France is Paris.
Water boils at 100 degrees Celsius.
The Pacific Ocean is the largest ocean on Earth.

Question: What is the capital of France?
Answer: The capital of France is Paris.

Exercise 2:
Instructions: Fill in the blanks with the appropriate words from the word bank.

Word Bank:
capital, water, ocean

1. The __ of France


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instructions: Answer the question using the context and generate no other text.

Context:
The Pacific Ocean is the largest ocean on Earth.
The tallest mountain in the world is Mount Everest.
The human body has 206 bones.

Question: Tell me about the largest ocean.
Answer: The largest ocean on Earth is the Pacific Ocean.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instructions: Answer the question using the context and generate no other text.

Context:
Photosynthesis is how plants make food using sunlight.
Water boils at 100 degrees Celsius.
Albert Einstein developed the theory of relativity.

Question: How do plants make their food?
Answer: Plants make their food through a process called photosynthesis.

Exercise 3:
Instructions: Answer the question using the context and generate no other text.

Context:
Photosynthesis is how plants make food using sunlight.
Water


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Instructions: Answer the question using the context and generate no other text.

Context:
Albert Einstein developed the theory of relativity.
Python is a popular programming language for data science.
Photosynthesis is how plants make food using sunlight.

Question: Who developed the theory of relativity?
Answer: Albert Einstein developed the theory of relativity.

Exercise 3:
Instructions: Answer the question using the context and generate no other text.

Context:
The Great Wall of China is a UNESCO World Heritage Site.
The E
Instructions: Answer the question using the context and generate no other text.

Context:
Shakespeare wrote many famous plays, including Hamlet.
The tallest mountain in the world is Mount Everest.
Python is a popular programming language for data science.

Question: What's a famous play by Shakespeare?
Answer: Hamlet.

