# Capstone Project - Build Your Own Generative AI Assistant with RAG, LLMs, and Multimodal Input

Estimated Time: 120 minutes

## 📋 Overview

This capstone project will guide you through the process of building a generative AI assistant that integrates retrieval-augmented generation (RAG), large language models (LLMs), and multimodal inputs. This assistant will be capable of retrieving relevant context, generating grounded responses, returning structured output, and optionally processing image/audio inputs. This project is designed to consolidate your learning and provide a portfolio piece showcasing your skills in generative AI.

## 🎯 Learning Outcomes
By the end of this project, you will be able to:
* Implement a retrieval-augmented generation system using vector embeddings and FAISS.
* Integrate large language models for generating contextually accurate responses.
* Design structured outputs for consistent data formatting.
* (Optional) Process and integrate image/audio inputs into your AI assistant.

### Task 1: Building the Retrieval Layer with FAISS [30 minutes]
Create the retrieval component of your AI assistant, which will fetch relevant documents based on user queries.

**Steps** :

1.  Load a pre-trained model for embeddings
2.  Sample corpus
3.  Generate embeddings
4.  Initialize FAISS index
5.  Query
6.  Search for top 2 most similar documents
7.  Print retrieved documents

In [None]:
# Task 1: Building the Retrieval Layer with FAISS
from sentence_transformers import SentenceTransformer
import faiss

# Load a pre-trained model for embeddings


# Sample corpus


# Generate embeddings


# Initialize FAISS index


# Query


# Search for top 2 most similar documents


# Print retrieved documents

## Grading of Lab Assignments:
The grading of this assignment is based on the test cases throughout this notebook within the `### BEGIN TESTS` and `### END TESTS` comments. 

Each task has a number of test cells. For example, the cell below uses tests to confirm the proper implementation of the retrieval system, LLM integration, and structured outputs.

Run all of these test cells throughout the project to confirm you pass the tests and are on the right track. Once you have passed all the tests in the entire notebook, you can submit your assignment.

Good luck!

In [None]:
### BEGIN TESTS
assert 'model' in locals(), "You should define a variable 'model' for the SentenceTransformer"

assert isinstance(model, SentenceTransformer), "model should be a SentenceTransformer instance"

assert 'corpus' in locals(), "You should define a variable 'corpus' containing the sample documents"
assert isinstance(corpus, list), "corpus should be a list of strings"
assert len(corpus) >= 3, "corpus should contain at least 3 documents"

assert 'corpus_embeddings' in locals(), "You should define a variable 'corpus_embeddings'"
assert hasattr(corpus_embeddings, 'shape'), "corpus_embeddings should have a shape attribute (numpy array)"

assert 'index' in locals(), "You should define a variable 'index' for the FAISS index"
assert hasattr(index, 'search'), "index should have a search method"

assert 'query' in locals(), "You should define a variable 'query' containing the search query"
assert isinstance(query, str), "query should be a string"

assert 'retrieved_indices' in locals(), "You should define a variable 'retrieved_indices'"
assert hasattr(retrieved_indices, '__len__'), "retrieved_indices should be iterable"
### END TESTS

### 🔍 Think about:
* How embeddings represent semantic meaning.
* The role of FAISS in efficient retrieval.
* How retrieval improves response accuracy.

### ✅ Success Checklist

*  Successfully retrieved relevant documents based on the query.
*  Understand the embedding generation process.
*  Efficiently used FAISS for document retrieval.

### 💡 Key Points

* Embeddings capture semantic information.
* FAISS enables fast vector search.
* Retrieval enhances LLM response grounding.

## Task 2: Integrating LLM for Response Generation [30 minutes]

Connect your retrieval results with an LLM to generate informed responses.

**Steps**:
1.  Load GPT-2 model and tokenizer
2.  Prepare prompt with retrieved context (use context from Task 1)
3.  Tokenize and generate response
4.  Print the generated response

In [None]:
# Task 2: Integrating LLM for Response Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 model and tokenizer


# Prepare prompt with retrieved context (use context from Task 1)


# Tokenize and generate response


# Print the generated response

In [None]:
### BEGIN TESTS
assert 'tokenizer' in locals(), "You should define a variable 'tokenizer' for the GPT2 tokenizer"
assert isinstance(tokenizer, GPT2Tokenizer), "tokenizer should be a GPT2Tokenizer instance"

assert 'llm_model' in locals() or 'gpt2_model' in locals(), "You should define a variable for the GPT2 model (avoid naming conflict with SentenceTransformer)"

assert 'prompt' in locals(), "You should define a variable 'prompt' containing the formatted prompt"
assert isinstance(prompt, str), "prompt should be a string"
assert len(prompt) > 20, "prompt should be substantial and contain context"

assert 'response' in locals(), "You should define a variable 'response' containing the generated response"
assert isinstance(response, str), "response should be a string"
assert len(response) > len(prompt), "response should be longer than the input prompt"
### END TESTS

### 🔍 Think about:
* How context affects the quality of the generated response.
* The importance of prompt design.
* How LLMs leverage context for better accuracy.

### ✅ Success Checklist
1. Generated coherent and contextually grounded responses.
2. Utilized LLMs effectively with contextual prompts.
3. Improved response accuracy using retrieval.

### 💡 Key Points
* Contextual prompts enhance LLM output.
* LLMs generate text based on input context.
* Retrieval provides factual grounding.

## Task 3: Designing Structured Outputs [30 minutes]

Implement structured output generation for consistent data formatting.
**Steps**:
1.  Create a structured response dictionary with the following keys:  
        *  "question": the query from Task 1
        *  "context": one of the retrieved documents from Task 1  
        *  "response": a sample response about how LLMs work


2.  Convert to JSON with proper formatting


3.  Print the structured output

In [None]:
# Task 3: Designing Structured Outputs
import json

# Create a structured response dictionary with the following keys:
# - "question": the query from Task 1
# - "context": one of the retrieved documents from Task 1  
# - "response": a sample response about how LLMs work


# Convert to JSON with proper formatting


# Print the structured output

In [None]:
### BEGIN TESTS
assert 'response_data' in locals(), "You should define a variable 'response_data' containing the structured data"
assert isinstance(response_data, dict), "response_data should be a dictionary"

required_keys = ["question", "context", "response"]

for key in required_keys:
    assert key in response_data, "response_data should contain the key: " + str(key)
    assert isinstance(response_data[key], str), "response_data" + "[" + str(key) + "]" + " should be a string"
    assert len(response_data[key]) > 0, "response_data" + "[" + str(key) + "]" + " should not be empty"

assert 'structured_output' in locals(), "You should define a variable 'structured_output' containing the JSON string"
assert isinstance(structured_output, str), "structured_output should be a JSON string"

# Test that it's valid JSON
try:
    json.loads(structured_output)
except json.JSONDecodeError:
    assert False, "structured_output should be valid JSON"
### END TESTS


### 🔍 Think about:
* The benefits of structured data.
* How to ensure output format consistency.
* The role of JSON in data interchange.

### ✅ Success Checklist
* Created structured JSON outputs.
* Ensured consistent response formatting.
* Applied structured output techniques effectively.

### 💡 Key Points
* Structured outputs ensure data consistency.
* JSON is a common format for structured data.
* Consistency supports integration and analysis.

### Task 4: Processing Multimodal Inputs (Optional) [30 minutes]

Extend your AI assistant to process image and audio inputs for a truly multimodal experience.

**Steps**:

1.  Load CLIP model and processor

2.  Load and preprocess image

3.  Generate image embeddings

In [None]:
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import requests
import numpy as np
from io import BytesIO

# Load CLIP model and processor

# Load and preprocess image

# Generate image embeddings

### 🔍 Think about:
* How images are encoded into embeddings.
* The role of vision-language models.
* How multimodal inputs enhance assistant capabilities.

### ✅ Success Checklist
1. Successfully processed image inputs.
2. Generated embeddings from visual data.
3. Integrated multimodal inputs into the assistant.

### 💡 Key Points
* Multimodal inputs expand assistant functionality.
* Vision-language models process visual data.
* Image embeddings capture semantic features.

## 🚀 Next Steps

Congratulations on completing the capstone project! You've built a generative AI assistant that integrates cutting-edge technologies. Next, explore scaling and deployment strategies to make your assistant production-ready.


## **📋Exemplar Solution**
<details>
<summary><strong>
Click HERE to see an exemplar solution
</summary></strong>

### **Task 1 Solution**
```python
from sentence_transformers import SentenceTransformer
import faiss

# Load a pre-trained model for embeddings
model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")

# Sample corpus
corpus = [
    "AI is transforming industries.", 
    "Generative models create new content.", 
    "LLMs can understand and generate text."
]

# Generate embeddings
corpus_embeddings = model.encode(corpus)

# Initialize FAISS index
index = faiss.IndexFlatL2(corpus_embeddings.shape[1])
index.add(corpus_embeddings)

# Query
query = "How do LLMs work?"
query_embedding = model.encode([query])

# Search
_, retrieved_indices = index.search(query_embedding, k=2)
print("Retrieved documents:", [corpus[idx] for idx in retrieved_indices[0]])
```
    
### **Task 2 Solution**
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Prepare prompt with retrieved context
context = "AI is transforming industries."
prompt = f"Context: {context} \nQuestion: How do LLMs work?"

# Tokenize and generate response
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, num_return_sequences=1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Response:", response)
```

### **Task 3 Solution**
```python
import json

# Example structured response
response_data = {
    "question": "How do LLMs work?",
    "context": "AI is transforming industries.",
    "response": "LLMs work by understanding and generating text based on context."
}

# Convert to JSON
structured_output = json.dumps(response_data, indent=4)
print("Structured Output:", structured_output)
```
    
### **Task 4 Solution**
```python
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model and processor
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load and preprocess image
image = Image.open("example.jpg")
inputs = clip_processor(images=image, return_tensors="pt")

# Generate image embeddings
image_features = clip_model.get_image_features(**inputs)
print("Image Features:", image_features)
```