# 🧠 RAG (Retrieval-Augmented Generation) Concepts - Greek Derby Chatbot

## Learning Objectives
By the end of this lesson, you will understand:
- What RAG is and why it's revolutionary for chatbots
- How vector databases work for semantic search
- LangChain framework and its components
- LangGraph for complex AI workflows
- Embeddings and their role in similarity search
- Memory management in conversational AI
- Web scraping and data processing for knowledge bases

---

## Q1: What is RAG and why is it better than traditional chatbots?

**Answer:**

RAG (Retrieval-Augmented Generation) is a powerful AI technique that combines **retrieval** of relevant information with **generation** of responses. It's like giving an AI assistant access to a vast library of information before answering questions.

### Traditional Chatbots vs RAG:

**Traditional Chatbots:**
- Rely only on pre-trained knowledge (limited to training data)
- Can't access real-time or specific information
- Often give generic or outdated answers
- Limited to what they "learned" during training

**RAG Chatbots:**
- Can access external knowledge sources (websites, documents, databases)
- Provide up-to-date, specific information
- Give contextually relevant answers
- Can be updated with new information without retraining

### How RAG Works:

1. **Retrieval**: Find relevant information from a knowledge base
2. **Augmentation**: Combine retrieved information with the user's question
3. **Generation**: Use an LLM to generate a response based on both

### Our Greek Derby RAG System:

```python
# Simplified RAG flow
def chat(self, question: str) -> str:
    # 1. RETRIEVAL: Find relevant documents
    relevant_docs = self.vector_store.similarity_search(question, k=4)
    
    # 2. AUGMENTATION: Combine question with context
    context = self._format_context(relevant_docs)
    augmented_prompt = f"Context: {context}\nQuestion: {question}"
    
    # 3. GENERATION: Generate response using LLM
    response = self.llm.invoke(augmented_prompt)
    return response
```

### Why RAG is Perfect for Our Greek Derby Chatbot:
- **Real-time Information**: Can access latest news from Gazzetta.gr
- **Specific Knowledge**: Knows about Olympiakos vs Panathinaikos specifically
- **Accurate Facts**: Retrieves actual information rather than generating from memory
- **Updatable**: Can add new information without retraining the model


## Q2: What are embeddings and how do they enable semantic search?

**Answer:**

Embeddings are numerical representations of text that capture semantic meaning. They convert words, sentences, or documents into high-dimensional vectors that can be compared mathematically.

### What are Embeddings?

Think of embeddings as a "translation" of text into numbers that computers can understand and compare. Similar texts get similar numbers, while different texts get different numbers.

### How Embeddings Work:

1. **Text Input**: "Olympiakos vs Panathinaikos derby"
2. **Embedding Model**: Converts to vector like `[0.1, -0.3, 0.8, ...]`
3. **Mathematical Comparison**: Can calculate similarity between vectors

### Our Embedding Setup:

```python
def _init_embeddings(self):
    """Initialize embeddings model"""
    self.embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small",
        dimensions=1024
    )
```

### Key Parameters:
- **Model**: `text-embedding-3-small` - OpenAI's efficient embedding model
- **Dimensions**: `1024` - Each text becomes a 1024-dimensional vector
- **Cost**: Small model = lower cost, good performance

### Semantic Search Example:

```python
# These would have similar embeddings:
"Olympiakos football team"
"Ολυμπιακός ποδοσφαιρική ομάδα"
"Red and white team from Piraeus"

# These would have different embeddings:
"Olympiakos football team"
"Panathinaikos basketball team"
"Cooking recipes"
```

### Why 1024 Dimensions?

- **More Dimensions**: Better semantic understanding
- **Computational Cost**: Balance between accuracy and speed
- **Storage**: Each document needs 1024 numbers stored
- **Similarity**: More precise similarity calculations

### Embedding Process in Our System:

1. **Document Chunking**: Split large documents into smaller pieces
2. **Embedding Generation**: Convert each chunk to a 1024-dimensional vector
3. **Vector Storage**: Store in Pinecone vector database
4. **Query Embedding**: Convert user question to vector
5. **Similarity Search**: Find most similar document vectors

### Text Chunking Strategy - Why Size Matters:

Our system uses `RecursiveCharacterTextSplitter` with specific parameters:

```python
self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Characters per chunk
    chunk_overlap=100,     # Overlap between chunks
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""]
)
```

### Chunk Size Guidelines:

**500 Characters (Our Choice):**
- **Pros**: 
  - Fits in most LLM context windows
  - Contains enough context for meaningful answers
  - Good balance between specificity and context
  - Lower embedding costs
- **Cons**: 
  - May split related information across chunks
  - Might miss broader context

**1000 Characters:**
- **Pros**: 
  - More context per chunk
  - Better for complex questions
  - Reduces number of chunks needed
- **Cons**: 
  - Higher embedding costs
  - May include irrelevant information
  - Harder to find specific details

**250 Characters:**
- **Pros**: 
  - Very specific and focused
  - Lower costs
  - Fast retrieval
- **Cons**: 
  - May lack sufficient context
  - More chunks to process
  - Risk of losing meaning

### Basic Rules for Chunk Size:

1. **Content Type**:
   - **News Articles**: 500-800 characters
   - **Technical Docs**: 300-500 characters
   - **Conversational Text**: 400-600 characters
   - **Code**: 200-400 characters

2. **LLM Context Window**:
   - **GPT-4**: Can handle larger chunks (1000+)
   - **GPT-3.5**: Better with smaller chunks (500-800)
   - **Local Models**: Often need smaller chunks (300-500)

3. **Question Complexity**:
   - **Simple Facts**: Smaller chunks (300-500)
   - **Complex Analysis**: Larger chunks (600-1000)
   - **Comparative Questions**: Medium chunks (500-700)

4. **Cost Considerations**:
   - **More Chunks** = Higher embedding costs
   - **Larger Chunks** = Higher LLM processing costs
   - **Balance**: Find the sweet spot for your use case

### Our Greek Derby Chatbot Reasoning:

We chose **500 characters** because:
- **Greek Text**: Greek characters are more information-dense
- **Sports Content**: News articles have natural paragraph breaks
- **Question Types**: Most questions need 1-2 sentences of context
- **Cost Efficiency**: Balance between quality and cost
- **Retrieval Accuracy**: Small enough to be specific, large enough for context

### Chunk Overlap Strategy:

**100 Character Overlap** ensures:
- **Context Preservation**: Important information isn't lost at boundaries
- **Better Retrieval**: Related information spans multiple chunks
- **Smoother Processing**: No gaps in information flow

### Testing Your Chunk Size:

```python
# Test different chunk sizes
def test_chunk_sizes(text, sizes=[250, 500, 750, 1000]):
    for size in sizes:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size,
            chunk_overlap=size//5  # 20% overlap
        )
        chunks = splitter.split_text(text)
        print(f"Size {size}: {len(chunks)} chunks")
        print(f"Average chunk length: {sum(len(c) for c in chunks)/len(chunks):.0f}")
        print("---")
```

### Best Practices:

1. **Start with 500**: Good default for most use cases
2. **Test with Your Data**: Try different sizes with your specific content
3. **Monitor Performance**: Track retrieval quality and costs
4. **Consider Language**: Some languages need different chunk sizes
5. **Iterate**: Adjust based on user feedback and performance metrics


## Q3: How do vector databases work and why do we use Pinecone?

**Answer:**

Vector databases are specialized databases designed to store and search high-dimensional vectors efficiently. They're perfect for RAG systems because they can quickly find similar vectors (documents) based on semantic similarity.

### What is a Vector Database?

A vector database stores:
- **Vectors**: High-dimensional arrays (our 1024-dimensional embeddings)
- **Metadata**: Additional information about each vector
- **Indexes**: Optimized structures for fast similarity search

### Why Not Regular Databases?

**Regular Database (SQL/NoSQL):**
- Stores text as strings
- Uses exact matches or simple text search
- Can't understand semantic meaning
- Slow for similarity searches

**Vector Database:**
- Stores text as numerical vectors
- Understands semantic similarity
- Optimized for similarity search
- Fast even with millions of vectors

### Our Pinecone Setup:

```python
def _init_vector_store(self):
    """Initialize vector store"""
    pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
    self.index = pc.Index(os.getenv('PINECONE_GREEK_DERBY_INDEX_NAME'))
    self.vector_store = PineconeVectorStore(embedding=self.embeddings, index=self.index)
```

### Pinecone Configuration:

```python
# Index configuration
{
    "name": "greek-derby-index",
    "dimension": 1024,        # Must match embedding dimensions
    "metric": "cosine",       # Similarity metric
    "pods": 1,               # Number of pods (scaling)
    "replicas": 1            # Number of replicas (reliability)
}
```

### Why Pinecone?

1. **Performance**: Optimized for vector similarity search
2. **Scalability**: Can handle millions of vectors
3. **Managed Service**: No infrastructure management needed
4. **LangChain Integration**: Easy to use with our framework
5. **Real-time**: Fast query responses
6. **Reliability**: Built-in redundancy and backup

### How Vector Search Works:

```python
# 1. User asks: "What is the history of the derby?"
query = "What is the history of the derby?"

# 2. Convert question to embedding
query_vector = embeddings.embed_query(query)
# Result: [0.1, -0.3, 0.8, 0.2, ...] (1024 dimensions)

# 3. Search for similar vectors
similar_docs = vector_store.similarity_search(query, k=4)
# Returns 4 most similar document chunks

# 4. Use retrieved documents as context
context = format_documents(similar_docs)
```

### Similarity Metrics:

**Cosine Similarity** (what we use):
- Measures angle between vectors
- Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
- Good for text similarity
- Scale-invariant (ignores vector magnitude)

**Other Metrics:**
- **Euclidean Distance**: Straight-line distance
- **Dot Product**: Vector multiplication
- **Manhattan Distance**: Sum of absolute differences

### Storage and Retrieval Process:

1. **Indexing**: Store document chunks as vectors
2. **Querying**: Convert question to vector
3. **Searching**: Find most similar vectors
4. **Ranking**: Sort by similarity score
5. **Retrieval**: Return top-k most relevant chunks


## Q4: What is LangChain and how does it help us build RAG systems?

**Answer:**

LangChain is a powerful framework for building applications with Large Language Models (LLMs). It provides a unified interface for working with different LLM providers and makes it easy to build complex AI applications like our RAG chatbot.

### What is LangChain?

LangChain is like a "Swiss Army knife" for AI applications. It provides:
- **Abstractions**: Common patterns for AI applications
- **Components**: Reusable building blocks
- **Chains**: Ways to combine components
- **Memory**: Conversation state management
- **Tools**: Integration with external systems

### Key LangChain Concepts:

1. **LLMs**: Language models (GPT, Claude, etc.)
2. **Prompts**: Templates for structuring inputs
3. **Chains**: Sequences of operations
4. **Memory**: Storing conversation history
5. **Agents**: AI that can use tools
6. **Vector Stores**: Databases for embeddings

### Our LangChain Usage:

```python
# 1. Language Model
from langchain.chat_models import init_chat_model
self.llm = init_chat_model("gpt-4o-mini", model_provider="openai")

# 2. Embeddings
from langchain_openai import OpenAIEmbeddings
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 3. Vector Store
from langchain_pinecone import PineconeVectorStore
self.vector_store = PineconeVectorStore(embedding=self.embeddings, index=self.index)

# 4. Memory
from langchain.memory import ConversationBufferMemory
self.memory = ConversationBufferMemory(return_messages=True)

# 5. Text Splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)
```

### Why Use LangChain?

**Without LangChain:**
```python
# Manual implementation - lots of boilerplate
import openai
import pinecone

# Complex setup for each component
openai.api_key = "your-key"
pinecone.init(api_key="your-key", environment="your-env")

# Manual prompt construction
prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
```

**With LangChain:**
```python
# Clean, simple, and reusable
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=self.llm,
    chain_type="stuff",
    retriever=self.vector_store.as_retriever()
)
response = qa_chain.run(question)
```

### LangChain Components We Use:

1. **LLM Wrapper**: `init_chat_model()` - Unified interface for different LLMs
2. **Embeddings**: `OpenAIEmbeddings` - Convert text to vectors
3. **Vector Store**: `PineconeVectorStore` - Store and search vectors
4. **Memory**: `ConversationBufferMemory` - Remember conversation history
5. **Text Splitter**: `RecursiveCharacterTextSplitter` - Split documents into chunks
6. **Prompts**: `ChatPromptTemplate` - Structure prompts consistently

### LangChain Benefits:

- **Abstraction**: Hide complexity of different LLM providers
- **Composability**: Mix and match components easily
- **Standardization**: Common patterns across AI applications
- **Memory Management**: Built-in conversation memory
- **Error Handling**: Robust error handling and retries
- **Extensibility**: Easy to add custom components

### LangChain Chains:

Chains combine multiple components into workflows:

```python
# RetrievalQA Chain (what we could use)
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=self.llm,
    chain_type="stuff",
    retriever=self.vector_store.as_retriever(),
    return_source_documents=True
)

# This automatically:
# 1. Retrieves relevant documents
# 2. Formats them as context
# 3. Sends to LLM with question
# 4. Returns the answer
```


## Q5: What is LangGraph and how do we use it for complex AI workflows?

**Answer:**

LangGraph is a library for building stateful, multi-actor applications with LLMs. It's like a "workflow engine" for AI applications, allowing you to create complex, branching logic that can handle different scenarios and maintain state across multiple steps.

### What is LangGraph?

LangGraph extends LangChain by adding:
- **State Management**: Persistent state across workflow steps
- **Conditional Logic**: Different paths based on conditions
- **Human-in-the-Loop**: Pause for human input when needed
- **Cycles and Loops**: Repeat steps until conditions are met
- **Multi-Agent Systems**: Multiple AI agents working together

### Our LangGraph Implementation:

```python
from langgraph.graph import StateGraph, START, END
from typing_extensions import TypedDict

class GreekDerbyState(TypedDict):
    question: str
    context: List[Document]
    answer: str

def _init_rag_system(self):
    """Initialize RAG system with LangGraph"""
    # Create the workflow graph
    workflow = StateGraph(GreekDerbyState)
    
    # Add nodes (workflow steps)
    workflow.add_node("retrieve", self._retrieve_documents)
    workflow.add_node("generate", self._generate_answer)
    
    # Define the flow
    workflow.add_edge(START, "retrieve")
    workflow.add_edge("retrieve", "generate")
    workflow.add_edge("generate", END)
    
    # Compile the graph
    self.rag_chain = workflow.compile()
```

### Why Use LangGraph Instead of Simple Chains?

**Simple Chain Approach:**
```python
# Linear, no branching
def simple_rag(question):
    docs = retrieve(question)
    answer = generate(question, docs)
    return answer
```

**LangGraph Approach:**
```python
# Complex workflow with conditions
def complex_rag(question):
    docs = retrieve(question)
    
    if len(docs) < 2:  # Not enough context
        docs = search_more_sources(question)
    
    if is_ambiguous(question):  # Need clarification
        return ask_for_clarification(question)
    
    answer = generate(question, docs)
    
    if needs_follow_up(answer):  # Generate follow-up questions
        follow_ups = generate_follow_ups(answer)
        return answer, follow_ups
    
    return answer
```

### Our Workflow Steps:

1. **Retrieve Documents** (`retrieve` node):
   ```python
   def _retrieve_documents(self, state: GreekDerbyState):
       """Retrieve relevant documents for the question"""
       question = state["question"]
       docs = self.vector_store.similarity_search(question, k=4)
       return {"context": docs}
   ```

2. **Generate Answer** (`generate` node):
   ```python
   def _generate_answer(self, state: GreekDerbyState):
       """Generate answer using retrieved context"""
       question = state["question"]
       context = state["context"]
       
       # Format context and create prompt
       formatted_context = self._format_context(context)
       prompt = self._create_prompt(question, formatted_context)
       
       # Generate response
       response = self.llm.invoke(prompt)
       return {"answer": response}
   ```

### State Management:

The `GreekDerbyState` TypedDict defines what data flows through our workflow:

```python
class GreekDerbyState(TypedDict):
    question: str        # User's question
    context: List[Document]  # Retrieved documents
    answer: str         # Generated answer
```

Each node can:
- **Read** from the state
- **Update** the state
- **Pass** data to the next node

### Advanced LangGraph Features We Could Use:

1. **Conditional Edges**:
   ```python
   def should_retrieve_more(state):
       return len(state["context"]) < 2
   
   workflow.add_conditional_edges(
       "retrieve",
       should_retrieve_more,
       {
           True: "retrieve_more",
           False: "generate"
       }
   )
   ```

2. **Human-in-the-Loop**:
   ```python
   def needs_clarification(state):
       return is_ambiguous(state["question"])
   
   workflow.add_node("clarify", ask_user_for_clarification)
   ```

3. **Cycles and Loops**:
   ```python
   def should_continue(state):
       return not state.get("satisfied", False)
   
   workflow.add_edge("generate", "retrieve")  # Loop back
   ```

### Benefits of LangGraph:

- **Flexibility**: Handle complex, non-linear workflows
- **State Persistence**: Maintain context across steps
- **Debugging**: Easy to trace execution flow
- **Scalability**: Can handle complex multi-step processes
- **Reusability**: Workflow components can be reused
- **Testing**: Easy to test individual nodes

### Real-World Example:

Our Greek Derby chatbot could be enhanced with:

```python
# Enhanced workflow
def enhanced_rag_workflow():
    workflow = StateGraph(GreekDerbyState)
    
    # Add more nodes
    workflow.add_node("classify", classify_question_type)
    workflow.add_node("retrieve", retrieve_documents)
    workflow.add_node("validate", validate_context)
    workflow.add_node("generate", generate_answer)
    workflow.add_node("fact_check", fact_check_answer)
    
    # Complex flow
    workflow.add_edge(START, "classify")
    workflow.add_conditional_edges("classify", route_by_type)
    workflow.add_edge("retrieve", "validate")
    workflow.add_conditional_edges("validate", needs_more_context)
    workflow.add_edge("generate", "fact_check")
    workflow.add_edge("fact_check", END)
    
    return workflow.compile()
```


## Q6: How do we manage conversation memory in our RAG chatbot?

**Answer:**

Conversation memory is crucial for creating engaging, context-aware chatbots. Our Greek Derby RAG chatbot uses LangChain's memory system to remember previous interactions and maintain conversation context.

### Why Memory Matters:

**Without Memory:**
```
User: "What is the history of the derby?"
Bot: "The Greek Derby between Olympiakos and Panathinaikos..."

User: "Who has won more times?"
Bot: "I need more context. Which teams are you asking about?"
```

**With Memory:**
```
User: "What is the history of the derby?"
Bot: "The Greek Derby between Olympiakos and Panathinaikos..."

User: "Who has won more times?"
Bot: "Based on our previous discussion about the Greek Derby, Olympiakos has won more times..."
```

### Our Memory Implementation:

```python
from langchain.memory import ConversationBufferMemory

def _init_memory(self):
    """Initialize conversation memory"""
    self.memory = ConversationBufferMemory(
        return_messages=True,
        memory_key="chat_history",
        output_key="answer"
    )
```

### Memory Types in LangChain:

1. **ConversationBufferMemory** (what we use):
   - Stores all conversation history
   - Simple but can get large
   - Good for short conversations

2. **ConversationBufferWindowMemory**:
   - Stores only last N messages
   - Prevents memory from growing too large
   - Good for long conversations

3. **ConversationSummaryMemory**:
   - Summarizes old conversations
   - Keeps recent messages + summary
   - Good for very long conversations

4. **ConversationTokenBufferMemory**:
   - Stores messages up to token limit
   - Automatically manages size
   - Good for cost control

### How Our Memory Works:

```python
def chat(self, question: str) -> str:
    """Main chat function with memory"""
    # Get conversation history
    chat_history = self.memory.chat_memory.messages
    
    # Create prompt with history
    prompt = self._create_prompt_with_history(question, chat_history)
    
    # Generate response
    response = self.llm.invoke(prompt)
    
    # Save to memory
    self.memory.save_context(
        {"input": question},
        {"output": response}
    )
    
    return response
```

### Memory Storage:

Our memory stores:
- **User Messages**: What the user asked
- **Bot Responses**: What the bot answered
- **Timestamps**: When each message was sent
- **Context**: Retrieved documents for each question

### Memory in Prompts:

```python
def _create_prompt_with_history(self, question: str, history: List[BaseMessage]) -> str:
    """Create prompt including conversation history"""
    
    # Format history
    history_text = ""
    for message in history[-6:]:  # Last 6 messages
        if isinstance(message, HumanMessage):
            history_text += f"User: {message.content}\n"
        elif isinstance(message, AIMessage):
            history_text += f"Assistant: {message.content}\n"
    
    # Create full prompt
    prompt = f"""
    Είστε ένας εξειδικευμένος βοηθός για το ελληνικό ποδόσφαιρο και το ντέρμπι Ολυμπιακός-Παναθηναϊκός.
    
    Προηγούμενη συνομιλία:
    {history_text}
    
    Χρησιμοποιήστε τις παρακάτω πληροφορίες για να απαντήσετε στην ερώτηση του χρήστη.
    Απαντήστε στα ελληνικά με φιλικό και ενημερωτικό τρόπο.
    
    Περιεχόμενο: {{context}}
    Ερώτηση: {question}
    Απάντηση:"""
    
    return prompt
```

### Memory Management Functions:

```python
def get_conversation_history(self) -> List[Dict[str, str]]:
    """Get formatted conversation history"""
    messages = self.memory.chat_memory.messages
    history = []
    
    for i in range(0, len(messages), 2):
        if i + 1 < len(messages):
            history.append({
                "user": messages[i].content,
                "bot": messages[i + 1].content
            })
    
    return history

def clear_memory(self):
    """Clear conversation memory"""
    self.memory.clear()

def get_stats(self) -> str:
    """Get conversation statistics"""
    messages = self.memory.chat_memory.messages
    user_messages = [m for m in messages if isinstance(m, HumanMessage)]
    bot_messages = [m for m in messages if isinstance(m, AIMessage)]
    
    return f"""
    Συνομιλία:
    Ερωτήσεις: {len(user_messages)}
    Απαντήσεις: {len(bot_messages)}
    Ξεκίνησε: {messages[0].additional_kwargs.get('timestamp', 'Unknown')}
    Τελευταία δραστηριότητα: {messages[-1].additional_kwargs.get('timestamp', 'Unknown')}
    """
```

### Memory Benefits:

1. **Context Awareness**: Bot remembers what was discussed
2. **Follow-up Questions**: Can answer "Who is that?" after discussing a player
3. **Personalization**: Can adapt responses based on user's interests
4. **Continuity**: Conversations feel natural and flowing
5. **User Experience**: More engaging and human-like interactions

### Memory Limitations:

1. **Size Growth**: Memory can get very large over time
2. **Cost**: More memory = more tokens = higher cost
3. **Relevance**: Old information might not be relevant
4. **Privacy**: Sensitive information stored in memory

### Best Practices:

1. **Limit History**: Only keep recent relevant messages
2. **Summarize Old**: Compress old conversations
3. **Clear Option**: Allow users to clear memory
4. **Context Filtering**: Only include relevant history
5. **Token Management**: Monitor and control token usage
