# üîí SCA Package Model Training - Google Colab

Train CodeLlama 7B to detect package vulnerabilities

**Before starting:**
1. Upload `sca_training_dataset.json` to Google Drive
2. Enable GPU: Runtime ‚Üí Change runtime type ‚Üí T4 GPU
3. Run cells one by one with Shift+Enter

## Step 1: Check GPU

In [None]:
import torch
print(f"üñ•Ô∏è  GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"üìä GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("‚ùå No GPU! Go to Runtime ‚Üí Change runtime type ‚Üí Select T4 GPU")

## Step 2: Install Dependencies

In [1]:
%%capture
# Silent installation (remove %%capture to see output)
!pip install -q transformers==4.37.0 datasets==2.16.0 peft==0.8.0 bitsandbytes==0.42.0 accelerate==0.26.0 sentencepiece
print("‚úÖ Dependencies installed!")

## Step 3: Mount Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')
print("‚úÖ Google Drive mounted!")

ValueError: mount failed

## Step 4: Load Dataset

In [None]:
from datasets import load_dataset
import json

# Change this path to where you uploaded the dataset
DATASET_PATH = "/content/drive/MyDrive/ai_sec/sca_training_dataset.json"

print(f"üìÇ Loading dataset from: {DATASET_PATH}")

# Load dataset
dataset = load_dataset('json', data_files=DATASET_PATH)

# Split into train/validation
dataset = dataset['train'].train_test_split(test_size=0.1, seed=42)

print(f"\nüìä Dataset Statistics:")
print(f"  Training samples: {len(dataset['train'])}")
print(f"  Validation samples: {len(dataset['test'])}")

# Show sample
print(f"\nüìù Sample training example:")
print(dataset['train'][0]['text'][:500] + "...")

## Step 5: Load Model (4-bit Quantization)

In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
import torch

# 4-bit quantization config (reduces 28GB to ~7GB!)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print("üì• Loading CodeLlama-7b-Instruct (this takes 2-3 minutes)...")

model = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("‚úÖ Model loaded!")
print(f"üíæ Model size in memory: ~7 GB (quantized from 28 GB)")

## Step 6: Configure LoRA (Train only 0.5% of parameters)

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank (lower = less parameters, faster training)
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Show trainable parameters
model.print_trainable_parameters()
# Expected: trainable params: ~40M / 7B (~0.5%!)

## Step 7: Tokenize Dataset

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length",
    )

print("üîÑ Tokenizing dataset...")

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

print("‚úÖ Dataset tokenized!")

## Step 8: Configure Training

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/ai_sec/models/sca-package",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,  # Effective batch size = 16
    learning_rate=2e-4,
    fp16=True,
    save_strategy="steps",
    save_steps=50,  # Save every 50 steps
    logging_steps=10,
    warmup_steps=50,
    optim="paged_adamw_8bit",
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)

print("‚úÖ Trainer configured!")
print(f"\n‚è±Ô∏è  Estimated training time: 2-4 hours on T4 GPU")
print(f"üíæ Checkpoints will be saved to Google Drive every 50 steps")

## Step 9: START TRAINING! üöÄ

In [None]:
print("üöÄ Starting training...")
print("‚è∞ This will take 2-4 hours")
print("üí° TIP: You can close this tab - training will continue!")
print("\n" + "="*60)

trainer.train()

print("\n" + "="*60)
print("‚úÖ Training complete!")
print("="*60)

## Step 10: Save Final Model

In [None]:
output_dir = "/content/drive/MyDrive/ai_sec/models/sca-package-final"

print(f"üíæ Saving final model to: {output_dir}")

trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print("‚úÖ Model saved successfully!")
print(f"\nüìÅ Model location: {output_dir}")
print("üì¶ You can now use this model for inference!")

## Step 11: Test the Model! üß™

In [None]:
print("üß™ Testing the trained model...\n")

# Test input
test_input = """[INST] Analyze this package.json for known vulnerabilities

```json
{
  "name": "my-app",
  "dependencies": {
    "express": "4.16.0",
    "lodash": "4.17.4",
    "axios": "0.18.0"
  }
}
``` [/INST]"""

# Tokenize and generate
inputs = tokenizer(test_input, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=True,
    top_p=0.95
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("="*60)
print("ü§ñ MODEL OUTPUT:")
print("="*60)
print(result)
print("="*60)

## üéâ Congratulations!

You've successfully trained your first AI security model!

**Next Steps:**
1. Train more models: SAST, IaC, Container, etc.
2. Deploy the model using vLLM
3. Integrate into your CI/CD pipeline

**Resources:**
- Model saved in: Google Drive ‚Üí ai_sec ‚Üí models ‚Üí sca-package-final
- Checkpoints: Google Drive ‚Üí ai_sec ‚Üí models ‚Üí sca-package