# Airflow DAG Generation - Model Inference on GPU

This notebook runs inference on the test dataset comparing:
- **Base Model**: Qwen/Qwen2.5-Coder-1.5B-Instruct
- **Fine-tuned Model**: andrea-t94/qwen2.5-1.5b-airflow-instruct (LoRA adapter)

**Dataset**: andrea-t94/airflow-dag-dataset (test split)

## Setup Instructions
1. Runtime ‚Üí Change runtime type ‚Üí T4 GPU (or better)
2. Run all cells in order
3. Models and dataset will be cached for re-runs

## 1. GPU Check and Environment Setup

In [1]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected. Please enable GPU in Runtime ‚Üí Change runtime type")

Wed Dec 17 14:16:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             45W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2. Install Dependencies (Unsloth for Fast Inference)

In [2]:
# Install Unsloth for optimized inference with flash attention
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q datasets>=3.2.0

print("‚úÖ Unsloth and dependencies installed successfully")

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m506.8/506.8 kB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m289.3/289.3 kB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m180.6/180.6 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## 3. Load Test Dataset from HuggingFace

In [3]:
from datasets import load_dataset
import json

# Load test split (will be cached automatically)
print("Loading test dataset from HuggingFace...")
dataset = load_dataset(
    "andrea-t94/airflow-dag-dataset",
    split="test",
    download_mode="reuse_cache_if_exists"  # Use cached version if available
)

print(f"\n‚úÖ Loaded {len(dataset)} test examples")

# Show dataset statistics
airflow_count = sum(1 for x in dataset if x.get('source') == 'airflow')
magpie_count = sum(1 for x in dataset if x.get('source') == 'magpie')

print(f"\nDataset composition:")
print(f"  - Airflow examples: {airflow_count} ({airflow_count/len(dataset)*100:.1f}%)")
print(f"  - Magpie examples: {magpie_count} ({magpie_count/len(dataset)*100:.1f}%)")

# Preview first example
print(f"\nFirst example:")
print(json.dumps(dataset[0]['messages'][:2], indent=2))  # Show system + user message

Loading test dataset from HuggingFace...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/10.4M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/560k [00:00<?, ?B/s]

data/eval-00000-of-00001.parquet:   0%|          | 0.00/574k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7414 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/412 [00:00<?, ? examples/s]

Generating eval split:   0%|          | 0/412 [00:00<?, ? examples/s]


‚úÖ Loaded 412 test examples

Dataset composition:
  - Airflow examples: 342 (83.0%)
  - Magpie examples: 70 (17.0%)

First example:
[
  {
    "role": "system",
    "content": "You are an expert Apache Airflow developer. Generate complete, valid Airflow DAGs based on given requirements."
  },
  {
    "role": "user",
    "content": "Create a data pipeline that demonstrates loading sample product data into a Snowflake table and validating the data load. The pipeline should insert 12 product records and then verify the total number of rows matches the expected count.\n\nAirflow Version: 3.0.1"
  }
]


## 4. Load Models

We'll load both:
1. Base model: Qwen/Qwen2.5-Coder-1.5B-Instruct
2. Fine-tuned model: Base + LoRA adapter from andrea-t94/qwen2.5-1.5b-airflow-instruct

In [5]:
from unsloth import FastLanguageModel
import torch

# Model identifiers
BASE_MODEL_ID = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
FINETUNED_ADAPTER_ID = "andrea-t94/qwen2.5-1.5b-airflow-instruct"

# Unsloth configuration
max_seq_length = 4096
dtype = None            # Auto-detect (float16 for T4)
load_in_4bit = False    # False for speed/precision, True for memory savings

print("Loading models with Unsloth...")
print("="*60)

# ---------------------------------------------------------
# 1. Loading the BASE model (Pure Base)
# ---------------------------------------------------------
print("\n1. Loading BASE model...")
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = BASE_MODEL_ID,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(base_model)
print(f"‚úÖ Base model loaded: {BASE_MODEL_ID}")

# ---------------------------------------------------------
# 2. Loading the FINE-TUNED model (Base + Adapter)
# ---------------------------------------------------------
print("\n2. Loading FINE-TUNED model...")
# OPTION A: The Unsloth "Magic" Way
# (Pass the Adapter ID directly; Unsloth finds the base model automatically)
finetuned_model, _ = FastLanguageModel.from_pretrained(
    model_name = FINETUNED_ADAPTER_ID, # <--- Put Adapter ID here
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Enable inference mode
FastLanguageModel.for_inference(finetuned_model)
print(f"‚úÖ Fine-tuned model loaded: {FINETUNED_ADAPTER_ID}")

print("\n" + "="*60)
print("‚úÖ All models loaded.")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Loading models with Unsloth...

1. Loading BASE model...
==((====))==  Unsloth 2025.12.6: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

‚úÖ Base model loaded: Qwen/Qwen2.5-Coder-1.5B-Instruct

2. Loading FINE-TUNED model...
==((====))==  Unsloth 2025.12.6: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/73.9M [00:00<?, ?B/s]

Unsloth 2025.12.6 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


‚úÖ Fine-tuned model loaded: andrea-t94/qwen2.5-1.5b-airflow-instruct

‚úÖ All models loaded.


## 5. Inference Configuration

In [6]:
# Unsloth TextStreamer configuration for fast generation
from unsloth.chat_templates import get_chat_template

# Apply chat template to tokenizer
tokenizer = get_chat_template(
    tokenizer,
    chat_template="qwen-2.5",
)

# Generation parameters optimized for Unsloth
GENERATION_CONFIG = {
    "max_new_tokens": 4096,  # Reduced for speed
    "temperature": 0.1,
    "top_p": 0.9,
    "do_sample": True,
    "use_cache": True,  # Enable KV cache for speed
}

print("Inference configuration (Unsloth optimized):")
for key, value in GENERATION_CONFIG.items():
    print(f"  {key}: {value}")

Inference configuration (Unsloth optimized):
  max_new_tokens: 4096
  temperature: 0.1
  top_p: 0.9
  do_sample: True
  use_cache: True


## 6. Helper Functions

In [7]:
def extract_code_from_response(response_text):
    """Extract Python code from model response."""
    # Try to find code block
    if "```python" in response_text:
        start_idx = response_text.find("```python") + len("```python")
        end_idx = response_text.find("```", start_idx)
        if end_idx != -1:
            return response_text[start_idx:end_idx].strip()

    # Try generic code block
    if "```" in response_text:
        parts = response_text.split("```")
        if len(parts) >= 3:
            return parts[1].strip()

    return response_text.strip()

## 7. Run Inference on Test Dataset

This will process models **separately** to maximize speed and avoid memory pressure.

In [8]:
import time
from tqdm.auto import tqdm
import gc
import torch
from unsloth import FastLanguageModel

# ---------------------------------------------------------
# CONFIGURATION
# ---------------------------------------------------------
MAX_EXAMPLES = None  # Set to None for full run
# Increase Batch Size for parallel processing
# (For a 1.5B model on T4 GPU, you can likely push this to 16 or 32)
BATCH_SIZE = 32

# Create the test subset
test_examples = dataset.select(range(min(MAX_EXAMPLES, len(dataset)))) if MAX_EXAMPLES else dataset

# Helper to slice dataset cleanly
def get_batch_as_list(dataset, start_idx, batch_size):
    end_idx = min(start_idx + batch_size, len(dataset))
    return dataset.select(range(start_idx, end_idx)).to_list()

# Helper to extract code (ensure this is defined)
def extract_code_from_response(text):
    if "```python" in text:
        return text.split("```python")[1].split("```")[0].strip()
    if "```" in text:
        return text.split("```")[1].strip()
    return text.strip()

# ---------------------------------------------------------
# PARALLEL GENERATION FUNCTION (The Speed Upgrade)
# ---------------------------------------------------------
def generate_parallel(model, tokenizer, examples, model_name="model"):
    results = []

    # 1. Prepare Prompts
    prompts = []
    for example in examples:
        messages = example['messages']
        # Apply template to inputs only
        prompt = tokenizer.apply_chat_template(
            messages[:-1],
            tokenize=False,
            add_generation_prompt=True
        )
        prompts.append(prompt)

    # 2. Batch Tokenize (Left Padding is crucial here!)
    # We do this inside the function to ensure the tokenizer config is respected
    inputs = tokenizer(
        prompts,
        return_tensors="pt",
        padding=True,           # Pad to the longest sequence in this batch
        truncation=True,
        max_length=4096,
    ).to("cuda")

    # 3. Generate (Processes all 8 rows simultaneously)
    outputs = model.generate(
        **inputs,
        **GENERATION_CONFIG,
        pad_token_id=tokenizer.eos_token_id,
    )

    # 4. Batch Decode
    # Slice off the input prompt tokens from the output
    generated_ids = outputs[:, inputs['input_ids'].shape[1]:]
    decoded_texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    # 5. Reassemble Results
    for i, text in enumerate(decoded_texts):
        original_msg = examples[i]['messages']
        results.append({
            'messages': original_msg[:-1] + [{'role': 'assistant', 'content': extract_code_from_response(text)}],
            'metadata': {
                **(examples[i].get('metadata') or {}),
                'model': model_name,
                'inference_time': time.time()
            }
        })

    return results

# ==========================================
# MAIN EXECUTION
# ==========================================

print(f"Strategy: True Parallel Batching (Batch Size: {BATCH_SIZE})")

# --- STEP 1: BASE MODEL ---
print("\nü§ñ STEP 1/2: Loading & Running BASE model...")
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL_ID,
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(base_model)

# CRITICAL: Configure Tokenizer for Batch Generation
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

base_results = []
start_time = time.time()

for i in tqdm(range(0, len(test_examples), BATCH_SIZE), desc="Base Model"):
    batch_data = get_batch_as_list(test_examples, i, BATCH_SIZE)
    batch_results = generate_parallel(base_model, tokenizer, batch_data, "Qwen2.5-Base")
    base_results.extend(batch_results)

# Cleanup
del base_model
torch.cuda.empty_cache()
gc.collect()

# --- STEP 2: FINETUNED MODEL ---
print("\n‚ú® STEP 2/2: Loading & Running FINETUNED model...")
finetuned_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=FINETUNED_ADAPTER_ID,
    max_seq_length=4096,
    dtype=None,
    load_in_4bit=False,
)
FastLanguageModel.for_inference(finetuned_model)

# CRITICAL: Re-configure Tokenizer (Loading a new model might reset it)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

finetuned_results = []

for i in tqdm(range(0, len(test_examples), BATCH_SIZE), desc="Finetuned Model"):
    batch_data = get_batch_as_list(test_examples, i, BATCH_SIZE)
    batch_results = generate_parallel(finetuned_model, tokenizer, batch_data, "Qwen2.5-Finetuned")
    finetuned_results.extend(batch_results)

print("\n" + "="*60)
print(f"‚úÖ Inference Complete.")
print(f"Total Base Results: {len(base_results)}")
print(f"Total Finetuned Results: {len(finetuned_results)}")
print("="*60)

Strategy: True Parallel Batching (Batch Size: 32)

ü§ñ STEP 1/2: Loading & Running BASE model...
==((====))==  Unsloth 2025.12.6: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Base Model:   0%|          | 0/13 [00:00<?, ?it/s]


‚ú® STEP 2/2: Loading & Running FINETUNED model...
==((====))==  Unsloth 2025.12.6: Fast Qwen2 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Finetuned Model:   0%|          | 0/13 [00:00<?, ?it/s]


‚úÖ Inference Complete.
Total Base Results: 412
Total Finetuned Results: 412


In [9]:
import json
import time
from datetime import datetime
from google.colab import files

# 1. Generate unique filenames with a timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
base_filename = f"base_model_outputs_{timestamp}.jsonl"
finetuned_filename = f"finetuned_model_outputs_{timestamp}.jsonl"

print(f"üìù Saving results to local Colab storage...")

# 2. Save Base Model Results
with open(base_filename, 'w') as f:
    for entry in base_results:
        # We save the full entry (messages + metadata) to ensure we have context later
        f.write(json.dumps(entry) + '\n')
print(f"   ‚úÖ Saved: {base_filename} ({len(base_results)} records)")

# 3. Save Fine-Tuned Model Results
with open(finetuned_filename, 'w') as f:
    for entry in finetuned_results:
        f.write(json.dumps(entry) + '\n')
print(f"   ‚úÖ Saved: {finetuned_filename} ({len(finetuned_results)} records)")

# 4. Download files to your laptop
# Note: If running inside VS Code, this might not pop up a window.
# If that happens, use the file explorer to right-click -> Download.
print("\n‚¨áÔ∏è Initiating downloads...")

files.download(base_filename)

# Small sleep ensures the browser handles the first download before starting the second
time.sleep(2)

files.download(finetuned_filename)

print("‚úÖ Download commands sent.")

üìù Saving results to local Colab storage...
   ‚úÖ Saved: base_model_outputs_20251217_151724.jsonl (412 records)
   ‚úÖ Saved: finetuned_model_outputs_20251217_151724.jsonl (412 records)

‚¨áÔ∏è Initiating downloads...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

‚úÖ Download commands sent.


## 11. Cleanup (Optional)

Free up GPU memory if needed.

In [None]:
import gc

# Delete models to free memory
del base_model
del finetuned_model
torch.cuda.empty_cache()
gc.collect()

print("‚úÖ GPU memory cleared")
!nvidia-smi