# MyContext AI - CodeGen2-1B Fine-tuning

Fine-tune CodeGen2-1B on Google Colab's free T4 GPU using training data from the Intent Dictionary.

**Total Cost: $0** 🎉

## What This Does
- Trains CodeGen2-1B to generate React/TypeScript components from natural language
- Uses LoRA (Low-Rank Adaptation) for memory efficiency
- Deploys to Hugging Face Hub for free inference
- Validates the MyContext AI concept with code-native model

## Key Advantages
- **Code-native**: Trained on Python, JavaScript, TypeScript
- **2048 context**: vs GPT-2's 1024 tokens
- **No repetition**: Code models don't loop like text models
- **Memory efficient**: 1B parameters fits easily in Colab free tier
- **Same cost**: $0 on Google Colab


## Cell 1: Environment Setup


In [None]:
# Check GPU availability
!nvidia-smi

# Install required packages
!pip install -q transformers datasets accelerate peft bitsandbytes huggingface_hub tensorboard

print("✅ Packages installed successfully!")


## Cell 2: Hugging Face Login


In [None]:
from huggingface_hub import notebook_login

# Login to Hugging Face (you'll need to get a token from https://huggingface.co/settings/tokens)
notebook_login()

print("✅ Logged in to Hugging Face!")


## Cell 3: Load StarCoder2-3B Model


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import gc

# Clear cache
torch.cuda.empty_cache()
gc.collect()

# Use StarCoder2-3B for code generation
model_name = "bigcode/starcoder2-3b"
print(f"🔄 Loading {model_name}...")

# Load in 8-bit for memory efficiency
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

print(f"✅ Model loaded: {model.num_parameters():,} parameters")
print(f"💾 GPU memory used: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")


## Cell 4: Load Training Data


In [None]:
from datasets import load_dataset

# Load your uploaded file (upload starcoder2_training_data.jsonl first)
dataset = load_dataset("json", data_files="starcoder2_training_data.jsonl")

# Split train/val
train_test_split = dataset["train"].train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]

print(f"📊 Train examples: {len(train_dataset)}")
print(f"📊 Eval examples: {len(eval_dataset)}")

# Show sample
print("\n📋 Sample training example:")
print(train_dataset[0]["text"][:300] + "...")


## Cell 5: Tokenize Data


In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,  # StarCoder2 supports 16k, use 2k for training
        padding="max_length"
    )

print("🔄 Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
eval_dataset = eval_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

print("✅ Datasets tokenized successfully")
print(f"Train dataset columns: {train_dataset.column_names}")
print(f"Eval dataset columns: {eval_dataset.column_names}")


## Cell 6: Setup LoRA


In [None]:
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration for StarCoder2
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # StarCoder2 attention
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

print("🔄 Applying LoRA adapters...")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print(f"💾 GPU memory after LoRA: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")


## Cell 7: Configure Training


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Training arguments optimized for StarCoder2
training_args = TrainingArguments(
    output_dir="./mycontext-starcoder2-lora",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,  # Higher for code models
    warmup_steps=100,
    logging_steps=25,
    eval_steps=100,
    eval_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    fp16=True,
    dataloader_pin_memory=False,
    dataloader_num_workers=0,
    push_to_hub=True,
    hub_model_id="faraja/mycontext-starcoder2",
    report_to="tensorboard",
)

# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)

print("✅ Trainer configured successfully")
print(f"Repository: {training_args.hub_model_id}")
