# Context-Aware Documentation Generator - Training Notebook

This notebook demonstrates how to fine-tune the Phi-3 model for code documentation generation using QLoRA.

## Setup and Installation

In [None]:
# Install required packages (run this in Colab)
!pip install -q transformers datasets peft bitsandbytes accelerate
!pip install -q sentence-transformers faiss-cpu
!pip install -q tree-sitter tree-sitter-python tree-sitter-javascript
!pip install -q gitpython loguru tqdm

In [None]:
import sys
import os
import torch
import json
from pathlib import Path
import pandas as pd
from datasets import Dataset
from transformers import TrainingArguments, Trainer
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Add project root to path (adjust if needed)
project_root = Path.cwd()
if 'notebooks' in str(project_root):
    project_root = project_root.parent
sys.path.append(str(project_root))

# Import our modules
from src.llm import create_documentation_generator
from src.parser import create_parser
from src.rag import create_rag_system

## Check GPU Availability

In [None]:
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("Running on CPU")

## Create Training Dataset

We'll create a synthetic dataset of code-docstring pairs for training.

In [None]:
# Sample training data - in practice, you'd collect real examples
training_examples = [
    {
        "code": "def add_numbers(a, b):\n    return a + b",
        "language": "python",
        "docstring": '"""\nAdd two numbers together.\n\nArgs:\n    a (int|float): First number\n    b (int|float): Second number\n\nReturns:\n    int|float: Sum of the two numbers\n"""'
    },
    {
        "code": "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)",
        "language": "python",
        "docstring": '"""\nCalculate the factorial of a number using recursion.\n\nArgs:\n    n (int): Non-negative integer to calculate factorial for\n\nReturns:\n    int: Factorial of n\n\nRaises:\n    RecursionError: If n is too large and exceeds recursion limit\n"""'
    },
    {
        "code": "class Rectangle:\n    def __init__(self, width, height):\n        self.width = width\n        self.height = height\n\n    def area(self):\n        return self.width * self.height",
        "language": "python",
        "docstring": '"""\nA rectangle class for geometric calculations.\n\nAttributes:\n    width (float): Width of the rectangle\n    height (float): Height of the rectangle\n"""'
    },
    {
        "code": "function validateEmail(email) {\n    const re = /^[^\\s@]+@[^\\s@]+\\.[^\\s@]+$/;\n    return re.test(email);\n}",
        "language": "javascript",
        "docstring": '/**\n * Validate an email address using regex.\n * \n * @param {string} email - The email address to validate\n * @returns {boolean} True if email is valid, false otherwise\n */'
    },
    {
        "code": "public int binarySearch(int[] arr, int target) {\n    int left = 0, right = arr.length - 1;\n    while (left <= right) {\n        int mid = left + (right - left) / 2;\n        if (arr[mid] == target) return mid;\n        if (arr[mid] < target) left = mid + 1;\n        else right = mid - 1;\n    }\n    return -1;\n}",
        "language": "java",
        "docstring": '/**\n * Perform binary search on a sorted array.\n * \n * @param arr The sorted array to search in\n * @param target The value to search for\n * @return Index of target if found, -1 otherwise\n */'
    }
]

print(f"Created {len(training_examples)} training examples")
print("\nExample:")
print(f"Code: {training_examples[0]['code']}")
print(f"Docstring: {training_examples[0]['docstring']}")

## Initialize Model and Prepare for Training

In [None]:
# Initialize documentation generator
print("Initializing documentation generator...")
doc_generator = create_documentation_generator()

# Prepare model for training
print("Preparing model for LoRA training...")
doc_generator.prepare_for_training()

print("Model prepared successfully!")
print(f"Model device: {doc_generator.device}")

## Format Training Data

In [None]:
# Format training examples for the model
formatted_examples = []

for example in training_examples:
    # Create prompt
    prompt = f"""Generate a docstring for this {example['language']} code:

```{example['language']}
{example['code']}
```

Docstring:"""
    
    # Format as training example
    formatted_example = {
        "input": prompt,
        "output": example['docstring']
    }
    formatted_examples.append(formatted_example)

# Convert to HuggingFace dataset format
training_data = doc_generator.create_training_dataset(formatted_examples)

print(f"Formatted {len(training_data)} training examples")
print("\nFirst formatted example:")
print(training_data[0]['text'][:200] + "...")

## Setup Training Configuration

In [None]:
# Create dataset
dataset = Dataset.from_list(training_data)

# Training arguments
training_args = TrainingArguments(
    output_dir="./phi3-docstring-lora",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    learning_rate=2e-4,
    fp16=torch.cuda.is_available(),
    logging_steps=10,
    save_strategy="epoch",
    remove_unused_columns=False,
    dataloader_pin_memory=False,
)

print("Training configuration:")
print(f"- Epochs: {training_args.num_train_epochs}")
print(f"- Batch size: {training_args.per_device_train_batch_size}")
print(f"- Learning rate: {training_args.learning_rate}")
print(f"- FP16: {training_args.fp16}")

## Data Collator for Training

In [None]:
from transformers import DataCollatorForLanguageModeling

def tokenize_function(examples):
    """Tokenize the training examples."""
    return doc_generator.tokenizer(
        examples['text'],
        truncation=True,
        padding=True,
        max_length=512,
        return_tensors="pt"
    )

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=doc_generator.tokenizer,
    mlm=False,  # Causal language modeling
)

print("Dataset tokenized and data collator prepared")
print(f"Tokenized dataset size: {len(tokenized_dataset)}")

## Initialize Trainer

In [None]:
# Initialize trainer
trainer = Trainer(
    model=doc_generator.model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    tokenizer=doc_generator.tokenizer,
)

print("Trainer initialized successfully!")

## Start Training

In [None]:
# Note: This is a minimal example with very few samples
# In practice, you'd want hundreds or thousands of examples
print("Starting training...")
print("⚠️ This is a demonstration with minimal data. For production, use much more training data.")

try:
    trainer.train()
    print("✅ Training completed successfully!")
except Exception as e:
    print(f"❌ Training error: {e}")
    print("This might be due to insufficient GPU memory or other configuration issues.")

## Save the Fine-tuned Model

In [None]:
# Save the model
output_dir = "./fine_tuned_phi3_docstring"
try:
    doc_generator.save_model(output_dir)
    print(f"✅ Model saved to {output_dir}")
except Exception as e:
    print(f"❌ Error saving model: {e}")

## Test the Fine-tuned Model

In [None]:
# Test with a new code example
test_code = """
def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr
"""

print("Testing the model with new code:")
print("Code:")
print(test_code)

# Generate docstring
try:
    docstring = doc_generator.generate_docstring(
        code=test_code.strip(),
        language="python",
        context="",
        style="google"
    )
    
    print("\nGenerated Docstring:")
    print(docstring)
    
except Exception as e:
    print(f"❌ Error generating docstring: {e}")

## Training Tips and Best Practices

### For Better Results:

1. **Larger Dataset**: Use thousands of high-quality code-docstring pairs
2. **Data Quality**: Ensure docstrings follow consistent style guidelines
3. **Language Diversity**: Include examples from multiple programming languages
4. **Code Complexity**: Include simple functions to complex classes and modules
5. **Domain Specific**: Fine-tune on code from your specific domain (web dev, ML, etc.)

### Hardware Requirements:
- **GPU**: NVIDIA GPU with at least 8GB VRAM (16GB+ recommended)
- **RAM**: 16GB+ system RAM
- **Storage**: 10GB+ free space for models and datasets

### Data Collection Sources:
- Open source repositories with good documentation
- Code documentation datasets (CodeSearchNet, etc.)
- Manual curation of high-quality examples
- Synthetic data generation with GPT-4 or similar models

## Model Evaluation

In [None]:
# Simple evaluation on held-out examples
test_examples = [
    {
        "code": "def max_element(lst):\n    return max(lst) if lst else None",
        "language": "python"
    },
    {
        "code": "function isEven(num) {\n    return num % 2 === 0;\n}",
        "language": "javascript"
    }
]

print("Evaluating model on test examples:")
print("=" * 50)

for i, example in enumerate(test_examples, 1):
    print(f"\nTest Example {i}:")
    print(f"Language: {example['language']}")
    print(f"Code: {example['code']}")
    
    try:
        docstring = doc_generator.generate_docstring(
            code=example['code'],
            language=example['language'],
            context="",
            style="google"
        )
        print(f"Generated Docstring: {docstring}")
    except Exception as e:
        print(f"Error: {e}")
    
    print("-" * 30)

## Next Steps

1. **Collect More Data**: Gather a larger, more diverse dataset
2. **Hyperparameter Tuning**: Experiment with different learning rates, batch sizes, etc.
3. **Evaluation Metrics**: Implement BLEU, ROUGE, or other text generation metrics
4. **A/B Testing**: Compare generated documentation quality with human-written docs
5. **Domain Adaptation**: Fine-tune for specific programming domains or frameworks
6. **Integration**: Deploy the fine-tuned model in the main application