<a href="https://colab.research.google.com/github/VamanPrabhakar-03/Gen-Ai-LLM-Model/blob/main/RAG_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **RAG Implementation Using Mistral 7B or Phi-2 in Google Colab**

This guide covers the implementation of a Retrieval-Augmented Generation (RAG) system using Google Colab, which includes the following steps:

1. **Data Preparation:** Extract documents from Wikipedia using wikipedia-api.

2. **Document Chunking**: Split documents into smaller, manageable chunks.

3. **Retriever:** Perform semantic search using sentence-transformers and FAISS.

4. **Re-Ranker:** Use Cross-Encoders for better retrieval accuracy.

5. **Generator:** Use Mistral 7B or Phi-2 for answer generation.

**Prerequisites**

Ensure your Colab runtime is set to GPU

* **Go to Runtime → Change runtime type → GPU.**











# **Step 1: Install Required Libraries**


In [None]:
!pip install wikipedia-api
!pip install chromadb
!pip install faiss-cpu
!pip install sentence-transformers
!pip install transformers torch
!pip install bitsandbytes accelerate
!pip install huggingface_hub



# **Step 2: Extract Wikipedia Pages**

In [None]:
import wikipediaapi
import logging

logging.basicConfig(level=logging.INFO)

def get_wikipedia_page(title):
    try:
        # Provide a proper User-Agent with contact info
        user_agent = "MyWikipediaClient/1.0 (Contact: your-email@example.com)"
        wiki_wiki = wikipediaapi.Wikipedia(user_agent=user_agent, language='en')
        page = wiki_wiki.page(title)
        if not page.exists():
            logging.warning(f"Page '{title}' not found.")
            return None
        return page.text
    except Exception as e:
        logging.error(f"Error fetching page '{title}': {e}")
        return None

# Example: Fetch Wikipedia pages
titles = ["List of Tamil films of 2025"]
#titles = ["Stock market", "Day trading","Artificial intelligence", "Machine learning"]

wiki_pages = {title: get_wikipedia_page(title) for title in titles}

# Print the first 500 characters for verification
for title, content in wiki_pages.items():
    if content:
        print(f"\n--- {title} ---\n{content[:500]}...\n")



--- List of Tamil films of 2025 ---
This is a list of Tamil language films produced in the Tamil cinema in India that are to be released/scheduled in 2025.

Box office collection
The following is the list of highest-grossing Tamil cinema films released in 2025. The rank of the films in the following table depends on the estimate of worldwide collections as reported by organizations classified as green by Wikipedia. There is no official tracking of domestic box office figures within India.

January–March
April–June
Upcoming release...



# **Step 3: Perform Document Chunking**
Use NLTK to break large text into smaller chunks

In [None]:
!pip uninstall -y nltk
!pip install nltk


Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.9.1


In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')  # Attempting to download 'punkt_tab'


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
nltk.data.path.append('/root/nltk_data')

In [None]:
import logging
logging.basicConfig(level=logging.INFO)

def chunk_text(text, max_tokens=512):
    sentences = sent_tokenize(text)
    current_chunk = []
    current_length = 0
    chunks = []

    for sentence in sentences:
        sentence_length = len(word_tokenize(sentence))

        # Handle sentences longer than max_tokens
        if sentence_length > max_tokens:
            logging.warning(f"A single sentence exceeds {max_tokens} tokens. Splitting sentence.")
            words = word_tokenize(sentence)
            for i in range(0, len(words), max_tokens):
                chunks.append(" ".join(words[i:i + max_tokens]))
            continue

        if current_length + sentence_length > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0

        current_chunk.append(sentence)
        current_length += sentence_length

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Chunk all Wikipedia data
chunked_data = {}
for title, content in wiki_pages.items():
    if content:
        chunks = chunk_text(content)
        chunked_data[title] = chunks
        logging.info(f"{title}: {len(chunks)} chunks created.")


In [None]:
chunked_data

{'List of Tamil films of 2025': ['This is a list of Tamil language films produced in the Tamil cinema in India that are to be released/scheduled in 2025. Box office collection\nThe following is the list of highest-grossing Tamil cinema films released in 2025. The rank of the films in the following table depends on the estimate of worldwide collections as reported by organizations classified as green by Wikipedia. There is no official tracking of domestic box office figures within India. January–March\nApril–June\nUpcoming releases\nSee also\nLists of Tamil-language films\nList of Tamil films of 2024\nList of highest-grossing Tamil films\n\nNotes\n\n\n== References ==']}

# **Step 4: Store Data in ChromaDB**
Generate embeddings using SentenceTransformer and store chunks in ChromaDB:

In [None]:
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize ChromaDB
client = chromadb.Client()
collection = client.create_collection("wiki_rag")

# Load SentenceTransformer for embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Store chunks in ChromaDB
for title, chunks in chunked_data.items():
    for i, chunk in enumerate(chunks):
        embedding = embedding_model.encode(chunk)
        doc_id = f"{title}_{i}"
        collection.add(ids=[doc_id], embeddings=[embedding.tolist()], documents=[chunk])

print("Data stored in ChromaDB.")


UniqueConstraintError: Collection wiki_rag already exists

# **Step 5: Perform Semantic Search**

In [None]:
def retrieve_context(query, n_results=5):
    query_embedding = embedding_model.encode([query])
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=n_results
    )
    return results['documents'][0]

query = "How is AI used ?"
retrieved_docs = retrieve_context(query)
print("Retrieved Documents:\n", "\n".join(retrieved_docs))


# **Step 6: Re-Rank Results Using Cross-Encoder**

In [None]:
from sentence_transformers import CrossEncoder

re_ranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def re_rank(query, documents):
    pairs = [[query, doc] for doc in documents]
    scores = re_ranker.predict(pairs)
    sorted_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
    return sorted_docs

re_ranked_docs = re_rank(query, retrieved_docs)
print("Top Re-ranked Document:", re_ranked_docs[0][:500])


In [None]:
from huggingface_hub import notebook_login
notebook_login()


In [None]:
!huggingface-cli Grizz03

# **Load Mistral 7B with Quantization (Memory Efficient)**

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure 4-bit Quantization for Efficient Loading
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Load the Model and Tokenizer
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", quantization_config=quant_config)

print("Mistral 7B Loaded Successfully with 4-bit Quantization.")


# **Generate Responses with RAG**

In [None]:
import torch

def generate_response(query, context):
    prompt = f"Based on the following context:\n{context}\nAnswer the question: {query}"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
query = "Explain how AI is used in trading."
context = "AI in trading is used for algorithmic trading, market sentiment analysis, and predictive modeling using historical data."
response = generate_response(query, context)

print("Response:", response)
