# Building, Training, and Testing a Local Model with Fine-Tuning and RAG in AWS Local Mode

This notebook guides you through all steps needed to build, train, and test a language model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA and Retrieval-Augmented Generation (RAG) in AWS local mode.

## 1. Set Up AWS Local Environment

In [None]:
# Install AWS CLI
!pip install awscli

# Install SageMaker SDK
!pip install sagemaker

# Install additional dependencies
!pip install torch transformers datasets peft faiss-cpu

Configure AWS credentials if not already done. You can run this in a terminal:
```bash
aws configure
# Enter your AWS Access Key ID, Secret Access Key, region (e.g., us-east-1), and output format (json)
```

## 2. Prepare Your Base Model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pre-trained model (e.g., a smaller model for local testing)
model_name = "meta-llama/Llama-2-7b-hf"  # Or any other model you have access to
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

## 3. Set Up PEFT with LoRA for Fine-Tuning

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,  # Alpha parameter for LoRA scaling
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Target attention modules
    lora_dropout=0.05,  # Dropout probability for LoRA layers
    bias="none",  # Bias type
    task_type="CAUSAL_LM"  # Task type
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Print trainable parameters info
model.print_trainable_parameters()

## 4. Prepare Training Data

In [None]:
from datasets import load_dataset

# Load your dataset (example using a public dataset)
# Replace with your dataset or use a sample dataset like this:
dataset = load_dataset("Abirate/english_quotes")

# Display sample data
print(dataset["train"][0])

In [None]:
# Preprocess the dataset
def preprocess_function(examples):
    return tokenizer(
        examples["quote"],  # Adjust field name based on your dataset
        truncation=True,
        max_length=512,
        padding="max_length"
    )

tokenized_dataset = dataset.map(preprocess_function, batched=True)
print(f"Dataset size: {len(tokenized_dataset['train'])} examples")

## 5. Set Up RAG Components

In [None]:
# Install necessary packages for RAG
!pip install langchain chromadb sentence-transformers

In [None]:
# Import necessary libraries
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader, DirectoryLoader
import os

# Create a sample documents directory and files for demonstration
os.makedirs("./sample_documents", exist_ok=True)

# Create a sample document
with open("./sample_documents/sample1.txt", "w") as f:
    f.write("""Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows you to fine-tune large language models 
    with significantly fewer resources. LoRA (Low-Rank Adaptation) is one of the most popular PEFT methods. 
    It works by freezing the original model weights and injecting trainable adapter layers.""")

with open("./sample_documents/sample2.txt", "w") as f:
    f.write("""Retrieval-Augmented Generation (RAG) combines retrieval mechanisms with text generation. 
    It enhances the knowledge of language models by retrieving relevant information from external sources 
    before generating a response. This helps with factual accuracy and reduces hallucinations.""")

# Load your knowledge base documents
loader = DirectoryLoader('./sample_documents/', glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()

print(f"Loaded {len(documents)} documents")

In [None]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} documents")

In [None]:
# Create embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db"
)
vectorstore.persist()

print("Vector store created and persisted")

## 6. Configure SageMaker for Local Training

In [None]:
import sagemaker
from sagemaker.huggingface import HuggingFace
import os

# Create scripts directory if it doesn't exist
os.makedirs("./scripts", exist_ok=True)

## 7. Create Training Script

In [None]:
%%writefile scripts/train.py
import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_from_disk

def main():
    # Load model and tokenizer
    model_name = "meta-llama/Llama-2-7b-hf"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    # Configure LoRA
    lora_config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    # Prepare model for training
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    
    # Load dataset
    dataset = load_from_disk("/opt/ml/input/data/training")
    
    # Set up training arguments
    training_args = TrainingArguments(
        output_dir="/opt/ml/model",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=2,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=100,
        save_strategy="epoch",
        warmup_steps=100,
    )
    
    # Create data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )
    
    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        data_collator=data_collator,
    )
    
    # Start training
    trainer.train()
    
    # Save the model
    trainer.save_model()

if __name__ == "__main__":
    main()

## 8. Prepare and Save Your Dataset

In [None]:
# Save your processed dataset to disk
tokenized_dataset.save_to_disk("./processed_dataset")
print("Dataset saved to ./processed_dataset")

## 9. Initialize SageMaker and Start Training

In [None]:
# Initialize SageMaker session
sagemaker_session = sagemaker.LocalSession()
role = "arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole"  # Dummy role for local mode

# Define the HuggingFace estimator
huggingface_estimator = HuggingFace(
    entry_point='train.py',  # Your training script
    source_dir='./scripts',  # Directory containing your scripts
    role=role,
    instance_count=1,
    instance_type='local',  # Use local mode
    transformers_version='4.28.1',
    pytorch_version='2.0.0',
    py_version='py310',
)

In [None]:
# Start the training job
huggingface_estimator.fit({
    'training': 'file://' + os.path.abspath('./processed_dataset')
})

## 10. Create RAG Inference Script

In [None]:
%%writefile rag_inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate

# Load base model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Load fine-tuned LoRA weights
model_path = "./model_output"
model = PeftModel.from_pretrained(base_model, model_path)

# Load vector store
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embedding_model)

# Define RAG prompt template
template = """
Answer the question based on the following context:

Context: {context}

Question: {question}

Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

def generate_response(question, k=3):
    # Retrieve relevant documents
    docs = vectorstore.similarity_search(question, k=k)
    context = "\n".join([doc.page_content for doc in docs])
    
    # Format prompt with retrieved context
    formatted_prompt = prompt.format(context=context, question=question)
    
    # Generate response
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the answer part (after the prompt)
    answer = response.split("Answer:")[-1].strip()
    
    return answer, docs

# Example usage
if __name__ == "__main__":
    question = "What is the capital of France?"
    answer, retrieved_docs = generate_response(question)
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print("\nRetrieved Documents:")
    for i, doc in enumerate(retrieved_docs):
        print(f"Document {i+1}:\n{doc.page_content}\n")

## 11. Test Your Model with RAG

In [None]:
# Import the inference module
from rag_inference import generate_response

# Test with sample questions
questions = [
    "What is PEFT?",
    "How does RAG work?",
    "What are the advantages of LoRA?"
]

for question in questions:
    answer, retrieved_docs = generate_response(question)
    print(f"Q: {question}")
    print(f"A: {answer}")
    print("Retrieved Documents:")
    for i, doc in enumerate(retrieved_docs):
        print(f"Document {i+1}: {doc.page_content[:100]}...")
    print("-" * 50)

## 12. Evaluate Your Model

In [None]:
# Install evaluation packages
!pip install evaluate rouge-score

In [None]:
import evaluate
from rag_inference import generate_response

# Load evaluation dataset (example)
eval_questions = [
    {"question": "What is PEFT?", "reference": "Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows you to fine-tune large language models with significantly fewer resources."},
    {"question": "How does RAG work?", "reference": "Retrieval-Augmented Generation (RAG) combines retrieval mechanisms with text generation. It enhances the knowledge of language models by retrieving relevant information from external sources."},
    # Add more evaluation examples
]

# Initialize metrics
rouge = evaluate.load("rouge")
exact_match = evaluate.load("exact_match")

# Evaluate
predictions = []
references = []

for example in eval_questions:
    question = example["question"]
    reference = example["reference"]
    
    prediction, _ = generate_response(question)
    
    predictions.append(prediction)
    references.append(reference)

# Calculate metrics
rouge_results = rouge.compute(predictions=predictions, references=references)
exact_match_results = exact_match.compute(predictions=predictions, references=references)

print("ROUGE Results:")
for metric, score in rouge_results.items():
    print(f"  {metric}: {score:.4f}")
    
print("\nExact Match Results:")
print(f"  Score: {exact_match_results['exact_match']:.4f}")

## 13. Deploy Your Model Locally (Optional)

In [None]:
from sagemaker.huggingface import HuggingFaceModel

# Create a HuggingFace Model
huggingface_model = HuggingFaceModel(
    model_data='file://' + os.path.abspath('./model_output'),
    role=role,
    transformers_version='4.28.1',
    pytorch_version='2.0.0',
    py_version='py310',
)

# Deploy the model locally
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type='local'
)

In [None]:
# Test the deployed model
response = predictor.predict({
    "inputs": "What is the capital of France?"
})
print(response)

In [None]:
# Clean up
predictor.delete_endpoint()

## Additional Tips

1. **Memory Management**: For large models, consider using techniques like:
   - Gradient checkpointing
   - Mixed precision training (fp16)
   - DeepSpeed or other optimization libraries

2. **Data Quality**: The quality of your RAG system heavily depends on your knowledge base. Ensure your documents are:
   - Relevant to your domain
   - Well-structured
   - Properly chunked (not too large, not too small)

3. **Hyperparameter Tuning**: Experiment with different values for:
   - LoRA rank (r)
   - Learning rate
   - Number of training epochs
   - Retrieval parameters (k, chunk size)

4. **Monitoring**: Track training metrics to detect issues early:
   - Loss curves
   - GPU memory usage
   - Training speed

5. **Troubleshooting AWS Local Mode**:
   - Ensure Docker is installed and running
   - Check for sufficient disk space
   - Monitor resource usage during training