# LLM Orchestration: RAG, DSPy, and Vector Databases with GPT-2

## Introduction

This Colab notebook provides a hands-on exploration of advanced LLM orchestration techniques, focusing on Retrieval-Augmented Generation (RAG), DSPy, vector databases, and chunking strategies. We'll be using GPT-2 as our base language model throughout this notebook.

## Setup

First, let's install the necessary libraries:

In [None]:
!pip install -q transformers bitsandbytes datasets torch langchain langchain_community langchain-huggingface faiss-cpu sentence-transformers dspy

In [None]:
import torch
import bitsandbytes
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2LMHeadModel, GPT2Tokenizer,TextStreamer ,pipeline
from datasets import load_dataset
from langchain import PromptTemplate, LLMChain
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
import dspy
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

In [None]:
#Now, let's import the required libraries:
#Let's set up our GPT-2 model:

# model_name = "gpt2"
# tokenizer = GPT2Tokenizer.from_pretrained(model_name)
# model = GPT2LMHeadModel.from_pretrained(model_name)
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto" , load_in_4bit=True)
# Create a text-generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

# Create a LangChain wrapper for the Hugging Face pipeline
llm = HuggingFacePipeline(pipeline=pipe)

## 1. Retrieval-Augmented Generation (RAG)

### 1.1 Setting up a Document Store
#### Preparing Data for Retrieval-Augmented Generation (RAG)

In the world of large language models and AI-powered question-answering systems, Retrieval-Augmented Generation (RAG) has emerged as a powerful technique. RAG combines the strengths of retrieval-based and generation-based approaches, allowing models to access and utilize external knowledge when generating responses. This approach can significantly improve the accuracy and relevance of AI-generated content.

The following code snippet demonstrates the essential steps in preparing data for a RAG system. We'll walk through the process of:

1. Loading a dataset from a reliable source
2. Converting the dataset into a suitable format for processing
3. Splitting the text into manageable chunks
4. Creating embeddings and storing them in a vector database

This preparation pipeline is crucial for building an effective RAG system. It allows us to take raw text data and transform it into a format that can be quickly and efficiently searched when our model needs to retrieve relevant information.

The code uses the Simple English Wikipedia dataset, which is an excellent resource for this purpose due to its broad coverage of topics and its use of straightforward language. We'll use tools from the Hugging Face ecosystem and the FAISS library to process and store our data.

In [None]:
# Load a dataset
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")

# Convert dataset to documents
documents = [doc['text'] for doc in dataset]

# Split the text into chunks
text_splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=20)
texts = text_splitter.create_documents(documents)

# Create embeddings and store them in a vector database
embeddings = HuggingFaceEmbeddings()
db = FAISS.from_documents(texts, embeddings)


### 1.2 Implementing RAG

Now, let's implement a simple RAG system using LangChain:

In [None]:
from langchain.chains import RetrievalQA

# Create a retrieval-based QA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 3})
)

# Test the RAG system
query = "What is the capital of France?"
result = qa.run(query)
print(result)

### Exercise 1.3: Extend the RAG System

Implement a multi-turn conversation system using RAG. This system should maintain context across multiple queries.

In [None]:
class ConversationalRAG:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.conversation_history = []

    def ask(self, question):
        context = " ".join(self.conversation_history[-3:])  # Use last 3 exchanges as context
        full_query = f"Context: {context}\n\nQuestion: {question}"
        response = self.qa_chain.run(full_query)
        self.conversation_history.append(f"Q: {question}\nA: {response}")
        return response

# Create an instance of ConversationalRAG
conv_rag = ConversationalRAG(qa)

# Test the conversational RAG system
print(conv_rag.ask("What is the capital of France?"))
print(conv_rag.ask("What is its population?"))
print(conv_rag.ask("Tell me about its famous landmarks."))


## 2. Exploring DSPy

### 2.1 Setting up DSPy

Let's set up a basic DSPy environment:

In [None]:
import dspy
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

class CustomLLM:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
        self.pipeline = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            max_new_tokens=256,
            temperature=0.7,
            top_p=0.95,
            repetition_penalty=1.15
        )

    def basic_request(self, prompt, **kwargs):
        response = self.pipeline(prompt, **kwargs)[0]['generated_text']
        return response[len(prompt):]

    def __call__(self, prompt, **kwargs):
        return self.basic_request(prompt, **kwargs)

# Create an instance of CustomLLM
custom_llm = CustomLLM(model_name)  # or your specific model name

# Configure DSPy to use our custom LLM
dspy.settings.configure(lm=custom_llm)

### 2.2 Building a Basic DSPy Pipeline

Let's create a simple question-answering pipeline using DSPy:

In [None]:
import dspy

class SimpleQA:
    def __init__(self, lm):
        self.lm = lm

    def generate_answer(self, question):
        prompt = f"Question: {question}\nAnswer:"
        return self.lm(prompt)

    def __call__(self, question):
        return self.generate_answer(question)

# Assuming you've already set up your language model as 'custom_llm'
# If not, you can create it like this:
# custom_llm = CustomLLM("your-model-name")

# Create an instance of SimpleQA
simple_qa = SimpleQA(dspy.settings.lm)

# Test the QA system
question = "What is machine learning?"
result = simple_qa(question)
print(result)

### Exercise 2.3: Extend the DSPy Pipeline

Implement a fact-checking module in the DSPy pipeline:

In [None]:
import dspy
import re

class RobustFactCheckingQA:
    def __init__(self, lm):
        self.lm = lm

    def generate_answer(self, question):
        prompt = f"Question: {question}\nAnswer:"
        return self.lm(prompt)

    def fact_check(self, question, answer):
        prompt = f"""
        Question: {question}
        Given Answer: {answer}

        Please fact-check the given answer and provide:
        1. Factual Accuracy (as a percentage)
        2. Explanation of your fact-check

        Your response should follow this format:
        Factual Accuracy: [percentage]
        Explanation: [your explanation]
        """
        return self.lm(prompt)

    def parse_fact_check_result(self, result):
        # Try to find factual accuracy using regex
        accuracy_match = re.search(r'Factual Accuracy:?\s*(\d+%?)', result, re.IGNORECASE)
        factual_accuracy = accuracy_match.group(1) if accuracy_match else 'N/A'

        # Everything after "Factual Accuracy" line is considered explanation
        explanation_parts = result.split('\n')[1:]  # Skip the first line which should be Factual Accuracy
        explanation = '\n'.join(explanation_parts).strip()

        return factual_accuracy, explanation

    def __call__(self, question):
        initial_answer = self.generate_answer(question)
        fact_check_result = self.fact_check(question, initial_answer)

        factual_accuracy, explanation = self.parse_fact_check_result(fact_check_result)

        return {
            'answer': initial_answer,
            'factual_accuracy': factual_accuracy,
            'explanation': explanation
        }

# Create an instance of RobustFactCheckingQA
fact_checking_qa = RobustFactCheckingQA(dspy.settings.lm)

# Test the fact-checking QA system
question = "Who invented the telephone?"
result = fact_checking_qa(question)
print(f"Answer: {result['answer']}")
print(f"Factual Accuracy: {result['factual_accuracy']}")
print(f"Explanation: {result['explanation']}")

## 3. Vector Databases and Efficient Retrieval

### 3.1 Creating a Vector Database

We'll use the SentenceTransformer model to create embeddings and FAISS to store them:

In [None]:
# Load a pre-trained SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create some sample sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science.",
    "Natural language processing deals with the interaction between computers and human language.",
    "Deep learning models are based on artificial neural networks with multiple layers."
]

# Create embeddings
embeddings = model.encode(sentences)

# Create a FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)


### 3.2 Performing Similarity Search

Now, let's perform a similarity search:

In [None]:
def similarity_search(query, index, model, sentences, k=2):
    # Create a query embedding
    query_embedding = model.encode([query])

    # Perform the search
    distances, indices = index.search(query_embedding, k)

    print(f"Query: {query}")
    print("Most similar sentences:")
    for i, idx in enumerate(indices[0]):
        print(f"{i+1}. {sentences[idx]} (Distance: {distances[0][i]:.4f})")

# Test the similarity search
similarity_search("AI and its applications", index, model, sentences)

### 3.3 Chunking Strategies

Let's implement a more advanced chunking strategy that considers sentence boundaries:

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def sentence_aware_chunks(text, max_chunk_size=200, overlap=20):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_chunk_size:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence[-overlap:] + " " + sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Test the chunking strategy
long_text = " ".join(sentences)
chunks = sentence_aware_chunks(long_text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")

### Exercise 3.4: Compare Chunking Strategies

Implement a fixed-length chunking strategy and compare its performance with the sentence-aware approach:

In [None]:
def fixed_length_chunks(text, chunk_size=200):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Compare the two chunking strategies
print("Sentence-aware chunks:")
sentence_chunks = sentence_aware_chunks(long_text)
for i, chunk in enumerate(sentence_chunks):
    print(f"Chunk {i+1} (length {len(chunk)}): {chunk[:50]}...")

print("\nFixed-length chunks:")
fixed_chunks = fixed_length_chunks(long_text)
for i, chunk in enumerate(fixed_chunks):
    print(f"Chunk {i+1} (length {len(chunk)}): {chunk[:50]}...")

# Evaluate coherence (you may need to implement a more sophisticated coherence metric)
def simple_coherence_score(chunks):
    return sum(1 for chunk in chunks if chunk[-1] in '.!?') / len(chunks)

print(f"\nSentence-aware coherence: {simple_coherence_score(sentence_chunks):.2f}")
print(f"Fixed-length coherence: {simple_coherence_score(fixed_chunks):.2f}")

## 4. Integrating RAG, DSPy, and Vector Databases

Now, let's bring everything together by creating an advanced RAG system that uses DSPy and our custom vector database.

In [None]:
import numpy as np

class AdvancedRAG:
    def __init__(self, faiss_index, sentences, model, lm):
        """
        Initialize the AdvancedRAG system.

        :param faiss_index: A pre-built FAISS index for efficient similarity search
        :param sentences: A list of sentences or documents that correspond to the FAISS index
        :param model: A sentence transformer model for encoding queries
        :param lm: A language model for generating answers (e.g., GPT-2 or Llama)
        """
        self.faiss_index = faiss_index
        self.sentences = sentences
        self.model = model  # Sentence transformer model
        self.lm = lm  # Language model for answer generation

    def retrieve(self, query, k=2):
        """
        Retrieve the most relevant contexts for a given query.

        :param query: The input question or query
        :param k: The number of contexts to retrieve (default is 2)
        :return: A list of the k most relevant sentences/documents
        """
        # Encode the query using the sentence transformer model
        query_embedding = self.model.encode([query])

        # Perform a similarity search in the FAISS index
        # This returns the distances and indices of the k nearest neighbors
        distances, indices = self.faiss_index.search(query_embedding, k)

        # Return the actual sentences/documents corresponding to the found indices
        return [self.sentences[idx] for idx in indices[0]]

    def generate_answer(self, question, context):
        """
        Generate an answer to the question based on the retrieved context.

        :param question: The input question
        :param context: The retrieved relevant contexts
        :return: The generated answer
        """
        # Construct a prompt for the language model
        prompt = f"""
        Context: {' '.join(context)}

        Question: {question}

        Please provide a concise answer to the question based on the given context.

        Answer:
        """
        # Use the language model to generate an answer based on the prompt
        return self.lm(prompt)

    def __call__(self, question):
        """
        Make the class callable. This method orchestrates the RAG process.

        :param question: The input question
        :return: A dictionary containing the question, retrieved context, and generated answer
        """
        # First, retrieve relevant contexts
        context = self.retrieve(question)

        # Then, generate an answer based on the question and retrieved contexts
        answer = self.generate_answer(question, context)

        # Return all information in a dictionary
        return {'question': question, 'context': context, 'answer': answer}

# Usage example:
# Note: You need to have these components set up before using this class:
# - index: Your FAISS index
# - sentences: Your list of sentences or documents
# - model: Your sentence transformer model
# - dspy.settings.lm: Your language model (GPT-2, Llama, etc.) configured in DSPy

# Create an instance of the AdvancedRAG class
advanced_rag = AdvancedRAG(index, sentences, model, dspy.settings.lm)

# Test the advanced RAG system with a sample question
question = "What are the main areas of AI?"
result = advanced_rag(question)

# Print the results
print(f"Question: {result['question']}")
print(f"Context: {result['context']}")
print(f"Answer: {result['answer']}")

### Exercise 4.1: Implement Multi-Stage Retrieval

Extend the AdvancedRAG class to implement a multi-stage retrieval process:

In [None]:
import dspy
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Custom HuggingFace language model for DSPy
class HFLanguageModel:
    def __init__(self, model_name="google/flan-t5-base"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = pipeline("text2text-generation", model=model_name, device="cuda" if torch.cuda.is_available() else "cpu")

    def generate(self, prompt, **kwargs):
        response = self.model(prompt, max_length=100, **kwargs)[0]['generated_text']
        return response

# Custom retriever using HuggingFace embeddings and FAISS
class HFRetriever:
    def __init__(self, sentences, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        self.sentences = sentences
        self.model = SentenceTransformer(model_name)
        self.index = self._create_index()

    def _create_index(self):
        embeddings = self.model.encode(self.sentences)
        index = faiss.IndexFlatL2(embeddings.shape[1])
        index.add(embeddings)
        return index

    def retrieve(self, query, k=3):
        query_embedding = self.model.encode([query])
        _, indices = self.index.search(query_embedding, k)
        return [self.sentences[idx] for idx in indices[0]]

# Set up our language model and retriever
lm = HFLanguageModel()
sentences = ["AI is a broad field of computer science.", "Machine learning is a subset of AI.", "Deep learning uses neural networks with multiple layers."]
retriever = HFRetriever(sentences)

class RAG(dspy.Module):
    def __init__(self, retriever, lm):
        super().__init__()
        self.retriever = retriever
        self.lm = lm

    def forward(self, question):
        context = self.retriever.retrieve(question)
        prompt = f"Context: {' '.join(context)}\n\nQuestion: {question}\n\nAnswer:"
        answer = self.lm.generate(prompt)
        return answer

# Create our RAG instance
rag = RAG(retriever, lm)

# Use the RAG system
question = "What is the relationship between machine learning and AI?"
answer = rag(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

class MultiStageRAG(dspy.Module):
    def __init__(self, retriever, lm):
        super().__init__()
        self.retriever = retriever
        self.lm = lm

    def rerank(self, passages, question):
        prompt = f"Question: {question}\n\nPassages: {' '.join(passages)}\n\nRank these passages by relevance to the question. Return only the two most relevant passages, separated by a newline:"
        reranked = self.lm.generate(prompt).strip().split('\n')
        return reranked[:2]

    def forward(self, question):
        initial_context = self.retriever.retrieve(question, k=5)
        reranked_context = self.rerank(initial_context, question)
        prompt = f"Context: {' '.join(reranked_context)}\n\nQuestion: {question}\n\nAnswer:"
        answer = self.lm.generate(prompt)
        return answer

# Create our multi-stage RAG instance
multi_stage_rag = MultiStageRAG(retriever, lm)

# Use the multi-stage RAG system
question = "What are the main areas of AI?"
answer = multi_stage_rag(question)
print(f"Question: {question}")
print(f"Answer: {answer}")