## Conversational Language Models: RAG, Fine-tuning, and MCP Server

## Part A: Retrieval Augmented Generation (RAG) Pipeline [10 points]

This section implements a RAG pipeline using a pre-trained conversational language model from Hugging Face. The RAG system enhances model responses by retrieving relevant information from an external knowledge base.


In [1]:
%pip install transformers langchain_community sentence-transformers datasets peft faiss-cpu accelerate>=0.26.0 google.generativeai

zsh:1: 0.26.0 not found
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.llms import HuggingFacePipeline
from langchain_community.document_loaders import TextLoader
import warnings
warnings.filterwarnings('ignore')

if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
print(f"Using device: {device}")


  from .autonotebook import tqdm as notebook_tqdm


Using device: mps


In [3]:
model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device)
tokenizer.pad_token = tokenizer.eos_token

print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Model moved to device: {device}")


Loaded model: microsoft/DialoGPT-medium
Model parameters: 354,823,168
Model moved to device: mps


In [4]:
knowledge_base_text = """
Machine Learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. 
Deep Learning uses neural networks with multiple layers to model and understand complex patterns in data.
Natural Language Processing (NLP) is a field of AI that focuses on the interaction between computers and human language.
Transformers are a type of neural network architecture that has revolutionized NLP, introduced in the paper "Attention is All You Need".
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that reads text bidirectionally.
GPT (Generative Pre-trained Transformer) is a family of autoregressive language models that generate text sequentially.
Retrieval Augmented Generation (RAG) combines information retrieval with text generation to provide more accurate and contextual responses.
Vector databases store embeddings of text documents to enable semantic search and retrieval.
Fine-tuning adapts pre-trained models to specific tasks or domains by training on task-specific data.
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that reduces the number of trainable parameters.
"""

with open("knowledge_base.txt", "w") as f:
    f.write(knowledge_base_text)

print("Knowledge base created with AI/ML domain information")


Knowledge base created with AI/ML domain information


In [5]:
loader = TextLoader("knowledge_base.txt")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = FAISS.from_documents(documents=texts, embedding=embeddings)
vectorstore.save_local("./faiss_index")
print(f"Created vector store with {len(texts)} document chunks")


Created vector store with 9 document chunks


In [6]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=150,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

llm = HuggingFacePipeline(pipeline=generator)

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

def rag_query(question):
    docs = retriever.invoke(question)
    context = " ".join([doc.page_content for doc in docs])
    
    prompt_text = f"""Use the following context to answer the question. 
If you don't know the answer, just say that you don't know.

Context: {context}

Question: {question}

Answer:"""
    
    result = llm.invoke(prompt_text)
    return {"answer": result, "context": docs}

print("RAG pipeline initialized")


Device set to use mps:0


RAG pipeline initialized


In [7]:
query = "What is Retrieval Augmented Generation?"
result = rag_query(query)

print("Query:", query)
print("\nAnswer:", result["answer"])
print("\nRetrieved Documents:")
for i, doc in enumerate(result["context"][:2], 1):
    print(f"\nDocument {i}: {doc.page_content[:100]}...")


Query: What is Retrieval Augmented Generation?

Answer: Use the following context to answer the question. 
If you don't know the answer, just say that you don't know.

Context: Retrieval Augmented Generation (RAG) combines information retrieval with text generation to provide more accurate and contextual responses. GPT (Generative Pre-trained Transformer) is a family of autoregressive language models that generate text sequentially. Vector databases store embeddings of text documents to enable semantic search and retrieval.
Fine-tuning adapts pre-trained models to specific tasks or domains by training on task-specific data.

Question: What is Retrieval Augmented Generation?

Answer:

Retrieved Documents:

Document 1: Retrieval Augmented Generation (RAG) combines information retrieval with text generation to provide ...

Document 2: GPT (Generative Pre-trained Transformer) is a family of autoregressive language models that generate...


## Part B: Fine-tuning with Domain-Specific Dataset

This section demonstrates fine-tuning a pre-trained conversational language model on a domain-specific dataset. We use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.


In [8]:
from datasets import Dataset
from peft import LoraConfig, get_peft_model, TaskType
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

fine_tune_model_name = "gpt2"
fine_tune_tokenizer = AutoTokenizer.from_pretrained(fine_tune_model_name)
fine_tune_model = AutoModelForCausalLM.from_pretrained(fine_tune_model_name)
fine_tune_model = fine_tune_model.to(device)

fine_tune_tokenizer.pad_token = fine_tune_tokenizer.eos_token

print(f"Loaded model for fine-tuning: {fine_tune_model_name}")
print(f"Model moved to device: {device}")


Loaded model for fine-tuning: gpt2
Model moved to device: mps


In [9]:
domain_data = [
    "Question: What is machine learning? Answer: Machine learning is a method of data analysis that automates analytical model building.",
    "Question: What is deep learning? Answer: Deep learning is a subset of machine learning that uses neural networks with multiple layers.",
    "Question: What is NLP? Answer: Natural Language Processing is a field of AI that focuses on understanding and generating human language.",
    "Question: What are transformers? Answer: Transformers are neural network architectures that use attention mechanisms to process sequences.",
    "Question: What is BERT? Answer: BERT is a bidirectional transformer model that reads text in both directions simultaneously.",
    "Question: What is GPT? Answer: GPT is a generative pre-trained transformer model that generates text autoregressively.",
    "Question: What is fine-tuning? Answer: Fine-tuning is the process of adapting a pre-trained model to a specific task or domain.",
    "Question: What is LoRA? Answer: LoRA is a parameter-efficient fine-tuning technique that uses low-rank matrices.",
    "Question: What is RAG? Answer: Retrieval Augmented Generation combines information retrieval with text generation.",
    "Question: What is a vector database? Answer: A vector database stores embeddings of documents for semantic search."
]

def preprocess_function(examples):
    return fine_tune_tokenizer(
        examples["text"],
        truncation=True,
        max_length=128,
        padding="max_length"
    )

dataset = Dataset.from_dict({"text": domain_data})
tokenized_dataset = dataset.map(preprocess_function, batched=True)

print(f"Created fine-tuning dataset with {len(dataset)} examples")


Map: 100%|██████████| 10/10 [00:00<00:00, 2475.83 examples/s]

Created fine-tuning dataset with 10 examples





In [10]:
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"]
)

peft_model = get_peft_model(fine_tune_model, lora_config)
peft_model = peft_model.to(device)
peft_model.print_trainable_parameters()

print("LoRA configuration applied to model")
print(f"PEFT model on device: {device}")


trainable params: 811,008 || all params: 125,250,816 || trainable%: 0.6475
LoRA configuration applied to model
PEFT model on device: mps


In [11]:
training_args = TrainingArguments(
    output_dir="./fine_tuned_model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    logging_steps=5,
    save_strategy="epoch"
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=fine_tune_tokenizer,
    mlm=False
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

print("Starting fine-tuning with LoRA...")
trainer.train()
print("Fine-tuning completed")


Starting fine-tuning with LoRA...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
5,3.8343


Fine-tuning completed


In [12]:
test_prompt = "Question: What is machine learning? Answer:"
inputs = fine_tune_tokenizer(test_prompt, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = peft_model.generate(**inputs, max_length=50, num_return_sequences=1)

generated_text = fine_tune_tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Test Prompt:", test_prompt)
print("Generated Response:", generated_text)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Test Prompt: Question: What is machine learning? Answer:
Generated Response: Question: What is machine learning? Answer: Machine learning is the process of learning.

What is machine learning?

Machine learning is a process of learning.

Machine learning is a process of.

Machine learning is a process


## Part C: Model Context Protocol (MCP) Server

This section implements an MCP Server that bridges an AI client to external data sources. The server provides two tools: a capital city lookup tool and a population lookup tool for various countries.


In [13]:
import json
import socket
from typing import Dict, Any
from http.server import HTTPServer, BaseHTTPRequestHandler

class MCPServer(BaseHTTPRequestHandler):
    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length)
        request_data = json.loads(post_data.decode('utf-8'))
        
        if request_data.get('method') == 'tools/list':
            response = {
                "tools": [
                    {
                        "name": "get_capital",
                        "description": "Get the capital city of a country",
                        "inputSchema": {
                            "type": "object",
                            "properties": {
                                "country": {
                                    "type": "string",
                                    "description": "Name of the country"
                                }
                            },
                            "required": ["country"]
                        }
                    },
                    {
                        "name": "get_population",
                        "description": "Get the population of a country",
                        "inputSchema": {
                            "type": "object",
                            "properties": {
                                "country": {
                                    "type": "string",
                                    "description": "Name of the country"
                                }
                            },
                            "required": ["country"]
                        }
                    }
                ]
            }
        elif request_data.get('method') == 'tools/call':
            tool_name = request_data.get('params', {}).get('name')
            arguments = request_data.get('params', {}).get('arguments', {})
            
            if tool_name == "get_capital":
                result = self.get_capital(arguments.get('country', 'United States'))
            elif tool_name == "get_population":
                result = self.get_population(arguments.get('country', 'United States'))
            else:
                result = {"error": "Unknown tool"}
            
            response = {"content": [{"type": "text", "text": json.dumps(result)}]}
        else:
            response = {"error": "Unknown method"}
        
        self.send_response(200)
        self.send_header('Content-type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps(response).encode('utf-8'))
    
    def get_capital(self, country: str) -> Dict[str, Any]:
        capitals = {
            "United States": "Washington, D.C.",
            "USA": "Washington, D.C.",
            "Ukraine": "Kyiv",
            "France": "Paris",
            "Germany": "Berlin",
            "United Kingdom": "London",
            "UK": "London",
            "Japan": "Tokyo",
            "China": "Beijing",
            "India": "New Delhi",
            "Canada": "Ottawa",
            "Australia": "Canberra",
            "Brazil": "Brasília",
            "Russia": "Moscow",
            "Italy": "Rome",
            "Spain": "Madrid",
            "Mexico": "Mexico City",
            "South Korea": "Seoul",
            "Argentina": "Buenos Aires",
            "South Africa": "Cape Town"
        }
        
        country_key = country.title()
        if country_key in capitals:
            return {
                "country": country,
                "capital": capitals[country_key]
            }
        else:
            return {
                "country": country,
                "capital": "Unknown",
                "note": f"Capital not found in database for {country}"
            }
    
    def get_population(self, country: str) -> Dict[str, Any]:
        populations = {
            "United States": 331900000,
            "USA": 331900000,
            "Ukraine": 43790000,
            "France": 67800000,
            "Germany": 83200000,
            "United Kingdom": 67000000,
            "UK": 67000000,
            "Japan": 125800000,
            "China": 1402000000,
            "India": 1380000000,
            "Canada": 38200000,
            "Australia": 25700000,
            "Brazil": 215300000,
            "Russia": 146200000,
            "Italy": 59000000,
            "Spain": 47400000,
            "Mexico": 128900000,
            "South Korea": 51700000,
            "Argentina": 45800000,
            "South Africa": 60000000
        }
        
        country_key = country.title()
        if country_key in populations:
            pop = populations[country_key]
            return {
                "country": country,
                "population": pop,
                "population_formatted": f"{pop:,}"
            }
        else:
            return {
                "country": country,
                "population": "Unknown",
                "note": f"Population data not found in database for {country}"
            }
    
    def log_message(self, format, *args):
        pass

def find_free_port(start_port=8000, max_attempts=10):
    for port in range(start_port, start_port + max_attempts):
        try:
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
                s.bind(('localhost', port))
                return port
        except OSError:
            continue
    raise RuntimeError(f"Could not find free port in range {start_port}-{start_port + max_attempts}")

print("MCP Server class and helper functions defined")


MCP Server class and helper functions defined


In [14]:
port = find_free_port(8000)
server = HTTPServer(('localhost', port), MCPServer)

import threading
server_thread = threading.Thread(target=server.serve_forever, daemon=True)
server_thread.start()

print(f"MCP Server started on http://localhost:{port}")
print("Server is ready to accept requests")


MCP Server started on http://localhost:8000
Server is ready to accept requests


### MCP Server Testing

The server can be tested using direct API calls or integrated with an LLM client like Google Gemini. Below are demonstrations of the server functionality.


In [15]:
import requests
import json

base_url = f"http://localhost:{port}"

print("="*70)
print("TESTING MCP SERVER")
print("="*70)

print("\n1. Listing available tools:")
response = requests.post(base_url, json={'method': 'tools/list'}, timeout=5)
tools = response.json().get('tools', [])
for tool in tools:
    print(f"   - {tool['name']}: {tool['description']}")

print("\n2. Testing get_capital tool:")
response = requests.post(base_url, json={
    'method': 'tools/call',
    'params': {
        'name': 'get_capital',
        'arguments': {'country': 'Ukraine'}
    }
}, timeout=5)
capital_result = json.loads(response.json()['content'][0]['text'])
print(f"   Country: {capital_result.get('country')}")
print(f"   Capital: {capital_result.get('capital')}")

print("\n3. Testing get_population tool:")
response = requests.post(base_url, json={
    'method': 'tools/call',
    'params': {
        'name': 'get_population',
        'arguments': {'country': 'United States'}
    }
}, timeout=5)
pop_result = json.loads(response.json()['content'][0]['text'])
print(f"   Country: {pop_result.get('country')}")
print(f"   Population: {pop_result.get('population_formatted')}")

print("\n" + "="*70)
print("MCP SERVER TESTING COMPLETE")
print("="*70)


TESTING MCP SERVER

1. Listing available tools:
   - get_capital: Get the capital city of a country
   - get_population: Get the population of a country

2. Testing get_capital tool:
   Country: Ukraine
   Capital: Kyiv

3. Testing get_population tool:
   Country: United States
   Population: 331,900,000

MCP SERVER TESTING COMPLETE


### Testing with Google Gemini

The MCP server can be integrated with Google Gemini to demonstrate how an LLM can use the custom tools to answer questions.


In [16]:
import google.generativeai as genai
import os

genai.configure(api_key=os.getenv("GEMINI_API_KEY", "AIzaSyAxi7IUunFMRMIsiZnRZ9KKe-TYIJUJ7ZQ"))
model = genai.GenerativeModel('gemini-2.5-flash')

MCP_SERVER_URL = f"http://localhost:{port}"

def call_mcp_tool(tool_name, arguments):
    try:
        response = requests.post(MCP_SERVER_URL, json={
            'method': 'tools/call',
            'params': {
                'name': tool_name,
                'arguments': arguments
            }
        }, timeout=5)
        if response.status_code == 200:
            result = json.loads(response.json()['content'][0]['text'])
            return result
        else:
            return {"error": f"Server returned status {response.status_code}"}
    except Exception as e:
        return {"error": str(e)}

def chat_with_gemini_and_tools(user_query):
    tool_called = False
    tool_result = None
    query_lower = user_query.lower()
    
    if "capital" in query_lower:
        country = "United States"
        countries = {
            "ukraine": "Ukraine",
            "france": "France",
            "germany": "Germany",
            "united kingdom": "United Kingdom",
            "uk": "United Kingdom",
            "japan": "Japan",
            "china": "China",
            "india": "India",
            "canada": "Canada",
            "australia": "Australia",
            "brazil": "Brazil",
            "russia": "Russia",
            "italy": "Italy",
            "spain": "Spain",
            "mexico": "Mexico",
            "south korea": "South Korea",
            "argentina": "Argentina",
            "south africa": "South Africa"
        }
        
        for key, val in countries.items():
            if key in query_lower:
                country = val
                break
        
        if "united states" in query_lower or "usa" in query_lower or "us" in query_lower:
            country = "United States"
        
        print(f"\n[Calling MCP tool: get_capital with country='{country}']")
        tool_result = call_mcp_tool("get_capital", {"country": country})
        tool_called = True
        
    elif "population" in query_lower or "people" in query_lower or "inhabitants" in query_lower:
        country = "United States"
        countries = {
            "ukraine": "Ukraine",
            "france": "France",
            "germany": "Germany",
            "united kingdom": "United Kingdom",
            "uk": "United Kingdom",
            "japan": "Japan",
            "china": "China",
            "india": "India",
            "canada": "Canada",
            "australia": "Australia",
            "brazil": "Brazil",
            "russia": "Russia",
            "italy": "Italy",
            "spain": "Spain",
            "mexico": "Mexico",
            "south korea": "South Korea",
            "argentina": "Argentina",
            "south africa": "South Africa"
        }
        
        for key, val in countries.items():
            if key in query_lower:
                country = val
                break
        
        if "united states" in query_lower or "usa" in query_lower or "us" in query_lower:
            country = "United States"
        
        print(f"\n[Calling MCP tool: get_population with country='{country}']")
        tool_result = call_mcp_tool("get_population", {"country": country})
        tool_called = True
    
    if tool_called and tool_result:
        result_prompt = f"""Based on the tool result, provide a helpful answer to the user.

Tool Result: {json.dumps(tool_result, indent=2)}

User question: {user_query}

Provide a natural, conversational answer using the tool result:"""
        
        try:
            final_response = model.generate_content(result_prompt)
            if final_response.candidates and len(final_response.candidates) > 0:
                candidate = final_response.candidates[0]
                if candidate.content and candidate.content.parts:
                    response_text = candidate.content.parts[0].text
                else:
                    response_text = "Tool result received. Here's the data:\n" + json.dumps(tool_result, indent=2)
            else:
                response_text = "Tool result received. Here's the data:\n" + json.dumps(tool_result, indent=2)
        except Exception as e:
            response_text = f"Tool executed successfully. Result: {json.dumps(tool_result, indent=2)}\n\n(Error formatting response: {str(e)})"
        
        return f"{response_text}\n\n[Tool Result: {json.dumps(tool_result, indent=2)}]"
    
    return "No relevant tool found for this query."

test_queries = [
    "What's the capital of Ukraine?",
    "What's the population of the United States?",
    "Tell me the capital of France and the population of Japan"
]

for query in test_queries:
    print(f"\n{'='*70}")
    print(f"Query: {query}")
    print("="*70)
    try:
        result = chat_with_gemini_and_tools(query)
        print(f"Response:\n{result}")
    except Exception as e:
        print(f"Error: {e}")
    print()



Query: What's the capital of Ukraine?

[Calling MCP tool: get_capital with country='Ukraine']
Response:
The capital of Ukraine is Kyiv.

[Tool Result: {
  "country": "Ukraine",
  "capital": "Kyiv"
}]


Query: What's the population of the United States?

[Calling MCP tool: get_population with country='United States']
Response:
The population of the United States is 331,900,000.

[Tool Result: {
  "country": "United States",
  "population": 331900000,
  "population_formatted": "331,900,000"
}]


Query: Tell me the capital of France and the population of Japan

[Calling MCP tool: get_capital with country='France']
Response:
The capital of France is Paris.

I don't have information about the population of Japan in my current results.

[Tool Result: {
  "country": "France",
  "capital": "Paris"
}]



## Resources

1. **Hugging Face Transformers:** https://huggingface.co/docs/transformers/
2. **LangChain Documentation:** https://python.langchain.com/
3. **PEFT (Parameter-Efficient Fine-Tuning):** https://huggingface.co/docs/peft/
4. **Model Context Protocol:** https://modelcontextprotocol.io/
5. **FAISS Vector Database:** https://github.com/facebookresearch/faiss
6. **Google Gemini API:** https://ai.google.dev/
