# üîí SCA Package Model Training - Google Colab (GitHub Dataset)

Train CodeLlama 7B to detect package vulnerabilities using 2024-2025 CVE data

**Before starting:**
1. Enable GPU: Runtime ‚Üí Change runtime type ‚Üí T4 GPU
2. Dataset will download automatically from GitHub
3. Run cells one by one with Shift+Enter

**Dataset:** 2024-2025 CVEs (~10,000 training examples, ~200MB)

## Step 1: Check GPU

In [None]:
import torch
print(f"üñ•Ô∏è  GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"üìä GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("‚ùå No GPU! Go to Runtime ‚Üí Change runtime type ‚Üí Select T4 GPU")

üñ•Ô∏è  GPU Available: False
‚ùå No GPU! Go to Runtime ‚Üí Change runtime type ‚Üí Select T4 GPU


## Step 2: Install Dependencies

In [None]:
%%capture
# Silent installation (remove %%capture to see output)
!pip install -q transformers==4.37.0 datasets==2.16.0 peft==0.8.0 accelerate==0.26.0 sentencepiece

# Try to install bitsandbytes - if it fails, we'll skip quantization
try:
    !pip install -q bitsandbytes==0.43.0  # Newer version with better CUDA support
    print("‚úÖ Dependencies installed (with bitsandbytes)")
except:
    print("‚ö†Ô∏è  bitsandbytes installation failed - will skip quantization")
    print("‚úÖ Dependencies installed (without bitsandbytes)")

## Step 3: Download Dataset from GitHub

In [None]:
import os

# GitHub repository details
GITHUB_REPO = "abhay2510kr/ai_sec"
DATASET_FILE = "datasets/sca_training_2024_2025.json"
DATASET_URL = f"https://raw.githubusercontent.com/{GITHUB_REPO}/main/{DATASET_FILE}"

print(f"üì• Downloading dataset from GitHub...")
print(f"Repository: {GITHUB_REPO}")
print(f"File: {DATASET_FILE}")

# Download dataset
!wget -q --show-progress {DATASET_URL} -O /content/sca_training_dataset.json

# Verify download
if os.path.exists('/content/sca_training_dataset.json'):
    file_size = os.path.getsize('/content/sca_training_dataset.json') / (1024 * 1024)
    print(f"\n‚úÖ Dataset downloaded successfully!")
    print(f"üìä File size: {file_size:.2f} MB")
else:
    print("\n‚ùå Download failed! Check if dataset exists in GitHub repo")
    print(f"URL: {DATASET_URL}")

üì• Downloading dataset from GitHub...
Repository: abhay2510kr/ai_sec
File: datasets/sca_training_2024_2025.json

‚úÖ Dataset downloaded successfully!
üìä File size: 14.21 MB


## Step 4: Load and Prepare Dataset

In [None]:
from datasets import load_dataset

DATASET_PATH = "/content/sca_training_dataset.json"

print(f"üìÇ Loading dataset from: {DATASET_PATH}")

# Load dataset
dataset = load_dataset('json', data_files=DATASET_PATH)

# Split into train/validation
dataset = dataset['train'].train_test_split(test_size=0.1, seed=42)

print(f"\nüìä Dataset Statistics:")
print(f"  Training samples: {len(dataset['train'])}")
print(f"  Validation samples: {len(dataset['test'])}")

# Show sample
print(f"\nüìù Sample training example:")
print(dataset['train'][0]['text'][:600] + "...")

## Step 5: Load Model (4-bit Quantization)

In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

print("üì• Loading CodeLlama-7b-Instruct...")

# Try different loading strategies based on available resources
if torch.cuda.is_available():
    print("  Attempting 4-bit quantization...")
    try:
        from transformers import BitsAndBytesConfig
        
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )
        
        model = AutoModelForCausalLM.from_pretrained(
            "codellama/CodeLlama-7b-Instruct-hf",
            quantization_config=bnb_config,
            device_map="auto",
            trust_remote_code=True,
        )
        print("  ‚úÖ Model loaded with 4-bit quantization (~7 GB)")
        
    except Exception as e:
        print(f"  ‚ö†Ô∏è  4-bit quantization failed: {str(e)[:100]}")
        print("  Trying 8-bit quantization...")
        
        try:
            model = AutoModelForCausalLM.from_pretrained(
                "codellama/CodeLlama-7b-Instruct-hf",
                load_in_8bit=True,
                device_map="auto",
                trust_remote_code=True,
            )
            print("  ‚úÖ Model loaded with 8-bit quantization (~14 GB)")
            
        except Exception as e:
            print(f"  ‚ö†Ô∏è  8-bit quantization failed: {str(e)[:100]}")
            print("  Loading without quantization (requires ~28 GB VRAM)...")
            
            model = AutoModelForCausalLM.from_pretrained(
                "codellama/CodeLlama-7b-Instruct-hf",
                torch_dtype=torch.float16,
                device_map="auto",
                trust_remote_code=True,
            )
            print("  ‚úÖ Model loaded in float16 (~14 GB)")
else:
    print("  Loading on CPU...")
    model = AutoModelForCausalLM.from_pretrained(
        "codellama/CodeLlama-7b-Instruct-hf",
        torch_dtype=torch.float32,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
    print("  ‚úÖ Model loaded on CPU")

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("\n‚úÖ Model and tokenizer ready!")

üì• Loading CodeLlama-7b-Instruct...
  Loading on CPU...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

: 

: 

: 

## Step 6: Configure LoRA (Train only 0.5% of parameters)

In [None]:
from peft import LoraConfig, get_peft_model

# Check if model was quantized
is_quantized = hasattr(model, 'is_loaded_in_4bit') or hasattr(model, 'is_loaded_in_8bit')

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank (lower = less parameters, faster training)
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Prepare model for training only if quantized
if is_quantized:
    try:
        from peft import prepare_model_for_kbit_training
        model = prepare_model_for_kbit_training(model)
        print("‚úÖ Model prepared for k-bit training")
    except ImportError:
        print("‚ö†Ô∏è  bitsandbytes not available - proceeding without k-bit preparation")
else:
    print("‚ÑπÔ∏è  Model not quantized - proceeding with standard LoRA")

model = get_peft_model(model, lora_config)

# Show trainable parameters
model.print_trainable_parameters()
# Expected: trainable params: ~40M / 7B (~0.5%!)

## Step 7: Tokenize Dataset

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length",
    )

print("üîÑ Tokenizing dataset...")

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

print("‚úÖ Dataset tokenized!")

## Step 8: Configure Training

In [None]:
from transformers import TrainingArguments, Trainer

# Check if GPU is available
use_fp16 = torch.cuda.is_available()
use_8bit = torch.cuda.is_available()

training_args = TrainingArguments(
    output_dir="/content/sca-package-checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,  # Effective batch size = 16
    learning_rate=2e-4,
    fp16=use_fp16,  # Only use fp16 if GPU available
    save_strategy="steps",
    save_steps=100,
    logging_steps=10,
    warmup_steps=50,
    optim="paged_adamw_8bit" if use_8bit else "adamw_torch",
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    report_to="none",
    eval_strategy="steps",
    eval_steps=100,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)

print("‚úÖ Trainer configured!")
if use_fp16:
    print(f"\n‚è±Ô∏è  Estimated training time: 4-6 hours on GPU")
else:
    print(f"\n‚è±Ô∏è  Estimated training time: 24+ hours on CPU")
print(f"üíæ Checkpoints will be saved to: /content/sca-package-checkpoints")
print(f"‚ö†Ô∏è  Remember to download the model before session ends!")

## Step 9: START TRAINING! üöÄ

In [None]:
print("üöÄ Starting training...")
print("‚è∞ This will take 2-4 hours")
print("üí° TIP: You can close this tab - training will continue!")
print("\n" + "="*60)

trainer.train()

print("\n" + "="*60)
print("‚úÖ Training complete!")
print("="*60)

## Step 10: Save Final Model

In [None]:
output_dir = "/content/sca-package-final"

print(f"üíæ Saving final model to: {output_dir}")

trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print("‚úÖ Model saved successfully!")
print(f"\nüìÅ Model location: {output_dir}")
print("\n‚¨áÔ∏è  IMPORTANT: Download the model now!")
print("   Run the next cell to create a downloadable zip file")

In [None]:
import shutil
from google.colab import files

# Create zip file
print("üì¶ Creating zip file...")
shutil.make_archive('/content/sca-package-final', 'zip', '/content/sca-package-final')

print("‚úÖ Zip file created!")
print(f"üìä Size: {os.path.getsize('/content/sca-package-final.zip') / (1024*1024):.2f} MB")

# Download
print("\n‚¨áÔ∏è  Starting download...")
files.download('/content/sca-package-final.zip')

print("\n‚úÖ Download complete!")
print("üí° Save this file - you can use it to run inference later!")

## Step 11: Download Model (IMPORTANT!)

## Step 12: Test the Model! üß™

In [None]:
print("üß™ Testing the trained model...\n")

# Test input
test_input = """[INST] Analyze this package.json for known vulnerabilities

```json
{
  "name": "my-app",
  "dependencies": {
    "express": "4.16.0",
    "lodash": "4.17.4",
    "axios": "0.18.0"
  }
}
``` [/INST]"""

# Tokenize and generate
inputs = tokenizer(test_input, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.1,
    do_sample=True,
    top_p=0.95
)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("="*60)
print("ü§ñ MODEL OUTPUT:")
print("="*60)
print(result)
print("="*60)

## üéâ Congratulations!

You've successfully trained your SCA model on 2024-2025 CVE data!

**What you have:**
- ‚úÖ Trained model (downloaded as zip)
- ‚úÖ Can detect package vulnerabilities
- ‚úÖ Ready for production use

**Next Steps:**
1. Train more models: SAST, IaC, Container, etc.
2. Deploy the model using vLLM or Ollama
3. Integrate into your CI/CD pipeline

**Model Info:**
- Base: CodeLlama-7b-Instruct
- Training data: 2024-2025 CVEs (~10K examples)
- Training time: ~4-6 hours
- Model size: ~7GB (quantized)