# 🎯 T5 Fine-Tuning — AU-Ggregates Text-to-SQL (Kaggle)

Fine-tunes `gaussalgo/T5-LM-Large-text2sql-spider` on your custom + Spider training data.

## Before you start
1. Go to **Settings** (right panel) → **Accelerator** → select **GPU T4 x2** or **P100**
2. Click **Add Data** (right panel) → upload your `t5_text2sql_5000_pairs.jsonl`
3. Your uploaded file will be at `/kaggle/input/YOUR_DATASET_NAME/t5_text2sql_5000_pairs.jsonl`

## Pipeline
| Step | Cell | What it does | Time |
|------|------|-------------|------|
| 1 | Install | Install dependencies | ~2 min |
| 2 | GPU Check | Verify GPU | instant |
| 3 | Load Data | Auto-find uploaded JSONL file | instant |
| 4 | Validate | Validate all 5,000 pairs | ~10 sec |
| 5 | Clean + Merge | Clean custom pairs + merge Spider data | ~3-5 min |
| 6 | Train | Fine-tune T5 (10 epochs) | ~60-120 min |
| 7 | Evaluate | Test accuracy on validation set | ~5 min |
| 8 | Test | Interactive SQL generation test | instant |
| 9 | Export | Zip model for download | ~2 min |

In [None]:
# ============================================================
# CELL 1: Install dependencies
# ============================================================
# CRITICAL: Must set BEFORE any torch import to prevent DataParallel NaN
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

!pip install -q transformers accelerate sentencepiece
!pip install -q datasets evaluate sqlparse
!pip install -q huggingface_hub peft

print('\nAll dependencies installed!')

In [None]:
# ============================================================
# CELL 2: Verify GPU
# ==================================sa==========================
import torch

print(f'PyTorch: {torch.__version__}')
print(f'CUDA:    {torch.cuda.is_available()}')

if torch.cuda.is_available():
    n_gpus = torch.cuda.device_count()
    print(f'GPUs:    {n_gpus}')
    for i in range(n_gpus):
        name = torch.cuda.get_device_name(i)
        props = torch.cuda.get_device_properties(i)
        vram = (getattr(props, 'total_memory', None) or getattr(props, 'total_mem', 0)) / 1024**3
        print(f'  GPU {i}: {name} ({vram:.1f} GB VRAM)')
    print('\n✅ GPU ready!')
else:
    print('❌ No GPU! Go to Settings → Accelerator → GPU T4 x2')

In [None]:
# ============================================================
# CELL 3: Load training data (auto-detect from /kaggle/input/)
# ============================================================
import glob
import os

# Auto-find the JSONL file in /kaggle/input/
matches = glob.glob('/kaggle/input/**/*.jsonl', recursive=True)

if matches:
    TRAINING_FILE = matches[0]
    print(f'📁 Found: {TRAINING_FILE}')
    print(f'   Size: {os.path.getsize(TRAINING_FILE) / 1024 / 1024:.1f} MB')
else:
    print('❌ No JSONL file found in /kaggle/input/')
    print('   Click "Add Data" in the right panel and upload your .jsonl file')
    print()
    print('   Available files in /kaggle/input/:')
    for f in glob.glob('/kaggle/input/**/*', recursive=True):
        print(f'     {f}')
    TRAINING_FILE = None

print(f'\n✅ TRAINING_FILE = {TRAINING_FILE}')

In [None]:
# ============================================================
# CELL 4: Validate training data
# ============================================================
import json
import re
from collections import Counter

SCHEMA_PREFIX = 'tables: ai_documents (id, source_table, file_name, project_name, searchable_text, metadata, document_type) | query:'
VALID_SOURCE_TABLES = {'Expenses', 'CashFlow', 'Project', 'Quotation', 'QuotationItem'}
NUMERIC_KEYS = {'Expenses', 'Amount', 'total_amount', 'volume', 'line_total'}

pairs = []
errors = []
source_table_counts = Counter()
intent_counts = Counter()

with open(TRAINING_FILE, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f, 1):
        line = line.strip()
        if not line:
            continue
        try:
            pair = json.loads(line)
        except json.JSONDecodeError as e:
            errors.append(f'Line {i}: Invalid JSON - {e}')
            continue

        inp = pair.get('input', '')
        tgt = pair.get('target', '')

        if 'input' not in pair or 'target' not in pair:
            errors.append(f'Line {i}: Missing input or target key')
            continue

        if not inp.startswith(SCHEMA_PREFIX):
            errors.append(f'Line {i}: Missing Spider schema prefix')

        tgt_upper = tgt.strip().upper()
        if not tgt_upper.startswith('SELECT'):
            errors.append(f'Line {i}: Target is not a SELECT statement')

        has_source = any(f"source_table = '{t}'" in tgt for t in VALID_SOURCE_TABLES)
        if not has_source:
            errors.append(f'Line {i}: Missing source_table filter')

        if "document_type = 'file'" not in tgt and "document_type = 'row'" not in tgt:
            errors.append(f'Line {i}: Missing document_type filter')

        for t in VALID_SOURCE_TABLES:
            if f"source_table = '{t}'" in tgt:
                source_table_counts[t] += 1

        if 'SUM(' in tgt_upper:
            intent_counts['sum'] += 1
        elif 'AVG(' in tgt_upper:
            intent_counts['average'] += 1
        elif 'COUNT(' in tgt_upper:
            intent_counts['count'] += 1
        elif 'GROUP BY' in tgt_upper and ('SUM' in tgt_upper or 'COUNT' in tgt_upper):
            intent_counts['compare'] += 1
        elif 'DISTINCT' in tgt_upper:
            intent_counts['list_categories'] += 1
        elif "document_type = 'file'" in tgt:
            intent_counts['list_files'] += 1
        else:
            intent_counts['query_data'] += 1

        pairs.append(pair)

print('=' * 60)
print(f'📊 VALIDATION REPORT')
print('=' * 60)
print(f'Total pairs:  {len(pairs)}')
print(f'Errors:       {len(errors)}')
print()
print('Source table distribution:')
for t in sorted(source_table_counts, key=source_table_counts.get, reverse=True):
    pct = source_table_counts[t] / len(pairs) * 100
    print(f'  {t:20s} {source_table_counts[t]:5d} ({pct:.1f}%)')
print()
print('Intent distribution:')
for intent in sorted(intent_counts, key=intent_counts.get, reverse=True):
    pct = intent_counts[intent] / len(pairs) * 100
    print(f'  {intent:20s} {intent_counts[intent]:5d} ({pct:.1f}%)')
print()

if errors:
    print(f'⚠️  First 10 errors:')
    for e in errors[:10]:
        print(f'  {e}')
    print()

if len(pairs) >= 4500 and len(errors) < len(pairs) * 0.05:
    print(f'✅ Data looks good! {len(pairs)} valid pairs ready.')
elif len(pairs) > 0:
    print(f'⚠️  {len(pairs)} valid pairs. Error rate: {len(errors)/max(len(pairs),1)*100:.1f}%')
else:
    print('❌ No valid pairs found! Check your JSONL file.')

In [None]:
# ============================================================
# CELL 5: Clean custom pairs + Merge with Spider dataset
# ============================================================
import json, re, hashlib, random
from collections import Counter
from datasets import load_dataset

SPIDER_LIMIT = 3000

def normalize_sql(sql):
    sql = sql.strip().rstrip(';').strip()
    sql = re.sub(r'\\s+', ' ', sql)
    return sql.lower()

def sql_fingerprint(pair):
    return hashlib.md5(normalize_sql(pair.get('target', '')).encode()).hexdigest()

TYPO_FIXES = {
    "source_table = 'expenses'": "source_table = 'Expenses'",
    "source_table = 'cashflow'": "source_table = 'CashFlow'",
    "source_table = 'Cashflow'": "source_table = 'CashFlow'",
    "source_table = 'cash_flow'": "source_table = 'CashFlow'",
    "source_table = 'project'": "source_table = 'Project'",
    "source_table = 'quotation'": "source_table = 'Quotation'",
    "source_table = 'quotationitem'": "source_table = 'QuotationItem'",
    "source_table = 'QuotationItems'": "source_table = 'QuotationItem'",
    "source_table = 'Quotation_Item'": "source_table = 'QuotationItem'",
}

def clean_pair(pair):
    tgt = pair['target'].strip()
    if not tgt.endswith(';'):
        tgt += ';'
    tgt = tgt.replace(';;', ';')
    for wrong, right in TYPO_FIXES.items():
        tgt = tgt.replace(wrong, right)
    tgt = re.sub(r"metadata->'(\\w+)'", r"metadata->>'\\1'", tgt)
    pair['target'] = tgt
    return pair

def validate_custom(pair):
    inp = pair.get('input', '')
    tgt = pair.get('target', '')
    if not inp.startswith(SCHEMA_PREFIX):
        return False
    if not tgt.strip().upper().startswith('SELECT'):
        return False
    has_src = any(f"source_table = '{t}'" in tgt for t in VALID_SOURCE_TABLES)
    if not has_src:
        return False
    if "document_type = 'file'" not in tgt and "document_type = 'row'" not in tgt:
        return False
    return True

# --- Step 1: Clean custom pairs ---
print('🧹 Step 1: Cleaning custom pairs...')
cleaned = [clean_pair(dict(p)) for p in pairs]
valid_custom = [p for p in cleaned if validate_custom(p)]
n_invalid = len(cleaned) - len(valid_custom)
print(f'   Valid: {len(valid_custom)}, Invalid: {n_invalid}')

# --- Step 2: Download Spider dataset ---
print(f'\n🕷️ Step 2: Downloading Spider dataset (limit={SPIDER_LIMIT})...')
spider_ds = load_dataset('spider', split='train')
spider_pairs = []
spider_seen = set()

for ex in spider_ds:
    q = ex.get('question', '')
    sql = ex.get('query', '')
    db = ex.get('db_id', '')
    if not q or not sql:
        continue
    if not sql.strip().upper().startswith('SELECT'):
        continue
    tgt = sql.strip()
    if not tgt.endswith(';'):
        tgt += ';'
    sfp = hashlib.md5(normalize_sql(tgt).encode()).hexdigest()
    if sfp in spider_seen:
        continue
    spider_seen.add(sfp)
    spider_pairs.append({
        'input': f'tables: {db} | query: {q}',
        'target': tgt
    })
    if len(spider_pairs) >= SPIDER_LIMIT:
        break

print(f'   Spider pairs loaded: {len(spider_pairs)}')

# --- Step 3: Merge + Deduplicate ---
print('\n🔀 Step 3: Merging + deduplicating...')
all_merged = valid_custom + spider_pairs
seen_fps = set()
deduped = []
n_dupes = 0

for p in all_merged:
    fp = sql_fingerprint(p)
    if fp not in seen_fps:
        seen_fps.add(fp)
        deduped.append(p)
    else:
        n_dupes += 1

random.seed(42)
random.shuffle(deduped)

# --- Step 4: Save merged file ---
MERGED_FILE = '/kaggle/working/training_final.jsonl'
with open(MERGED_FILE, 'w', encoding='utf-8') as f:
    for p in deduped:
        f.write(json.dumps(p, ensure_ascii=False) + '\n')

TRAINING_FILE = MERGED_FILE

print()
print('=' * 60)
print('  CLEAN + MERGE REPORT')
print('=' * 60)
print(f'  Custom pairs (valid):    {len(valid_custom)}')
print(f'  Custom pairs (invalid):  {n_invalid}')
print(f'  Spider pairs:            {len(spider_pairs)}')
print(f'  Duplicates removed:      {n_dupes}')
print(f'  Total merged:            {len(deduped)}')
print(f'  Output file:             {MERGED_FILE}')
print('=' * 60)
print()
if len(deduped) >= 5000:
    print(f'✅ {len(deduped)} pairs ready for training!')
else:
    print(f'⚠️  Only {len(deduped)} pairs. Consider adding more custom data.')
print(f'\n📌 TRAINING_FILE updated to: {TRAINING_FILE}')

In [None]:
# ============================================================
# CELL 6: Fine-tune T5 with LoRA (PEFT) -- bulletproof version
# ============================================================
# LoRA = Low-Rank Adaptation. Freezes base model, trains ~2.4M
# adapter params (~0.3%). After training, merges back to normal T5.
# ============================================================
import json, time, gc, torch
from pathlib import Path
from datasets import Dataset
from transformers import (
    AutoModelForSeq2SeqLM, AutoTokenizer,
    DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType

# --- Verify single GPU ---
assert torch.cuda.device_count() == 1, (
    f'Expected 1 GPU but found {torch.cuda.device_count()}! '
    'Restart kernel and make sure Cell 1 has os.environ[CUDA_VISIBLE_DEVICES]=0'
)
print(f'GPU: {torch.cuda.get_device_name(0)} (single GPU mode)')

# --- Config ---
MODEL_NAME = 'gaussalgo/T5-LM-Large-text2sql-spider'
OUTPUT_DIR = '/kaggle/working/t5-finetuned'
EPOCHS = 10
BATCH_SIZE = 2
LEARNING_RATE = 1e-4     # safer than 3e-4 for fp16 LoRA
MAX_INPUT_LEN = 512
MAX_TARGET_LEN = 256
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

# --- Load data ---
print('Loading training data...')
inputs, targets = [], []
with open(TRAINING_FILE, 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        try:
            pair = json.loads(line)
            if 'input' in pair and 'target' in pair:
                inputs.append(pair['input'])
                targets.append(pair['target'])
        except json.JSONDecodeError:
            continue

dataset = Dataset.from_dict({'input': inputs, 'target': targets})
split = dataset.train_test_split(test_size=0.1, seed=42)
train_ds, val_ds = split['train'], split['test']
print(f'   Train: {len(train_ds)}, Validation: {len(val_ds)}')

# --- Load model in float32 first, then LoRA ---
print(f'Loading {MODEL_NAME}...')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
base_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
print(f'   Base parameters: {base_model.num_parameters():,}')

base_model.gradient_checkpointing_enable()
print('   Gradient checkpointing: ON')

print('Applying LoRA adapter...')
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT,
    target_modules=['q', 'v'],
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

# --- Tokenize ---
# NOTE: Do NOT use padding='max_length' here -- let DataCollatorForSeq2Seq handle padding.
# Only truncate. The collator will pad dynamically per batch and apply -100 masking.
def tokenize(examples):
    model_inputs = tokenizer(
        examples['input'], max_length=MAX_INPUT_LEN,
        truncation=True
    )
    labels = tokenizer(
        text_target=examples['target'], max_length=MAX_TARGET_LEN,
        truncation=True
    )
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

print('Tokenizing...')
tok_train = train_ds.map(tokenize, batched=True, remove_columns=['input', 'target'])
tok_val = val_ds.map(tokenize, batched=True, remove_columns=['input', 'target'])

# --- DIAGNOSTIC: Verify labels are NOT empty ---
print('\n🔍 Label diagnostic (first 3 examples):')
for i in range(min(3, len(tok_train))):
    lbl = tok_train[i]['labels']
    print(f'  Example {i}: {len(lbl)} label tokens, first 5: {lbl[:5]}')
    if len(lbl) == 0:
        raise ValueError(f'Example {i} has EMPTY labels! Check training data.')
print()

# --- Training args ---
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    eval_strategy='epoch',
    save_strategy='epoch',
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    save_total_limit=2,
    fp16=False,
    bf16=False,
    report_to='none',
    dataloader_num_workers=0,
    warmup_steps=200,
    weight_decay=0.01,
    max_grad_norm=1.0,
    gradient_accumulation_steps=4,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tok_train,
    eval_dataset=tok_val,
    processing_class=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding=True),
)

# --- Train ---
gc.collect(); torch.cuda.empty_cache()
steps_per_epoch = len(tok_train) // (BATCH_SIZE * 4)
print(f'\nStarting LoRA training: {EPOCHS} epochs, batch={BATCH_SIZE}, lr={LEARNING_RATE}')
print(f'   Steps per epoch: ~{steps_per_epoch}')
print(f'   Estimated time: ~{EPOCHS * steps_per_epoch * 0.5 / 60:.0f} minutes on single T4')
print(f'   Single GPU: {torch.cuda.get_device_name(0)}\n')

start = time.time()
trainer.train()
elapsed = (time.time() - start) / 60

# --- Merge LoRA adapter back into base model ---
print('\nMerging LoRA adapter into base model...')
merged_model = model.merge_and_unload()

# --- Save merged model ---
save_path = f'{OUTPUT_DIR}/final'
merged_model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

model = merged_model

print(f'\nTraining complete in {elapsed:.1f} minutes!')
print(f'   Model saved to: {save_path}')
print(f'   LoRA adapter merged -- this is a normal T5 model now.')

# --- AUTO-PUSH TO HUGGINGFACE (runs automatically after training) ---
print('\n' + '='*60)
print('PUSHING TO HUGGINGFACE HUB...')
print('='*60)
try:
    from huggingface_hub import login, HfApi
    HF_TOKEN = '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'
    HF_REPO = 't5-auggregates-text2sql'
    login(token=HF_TOKEN)
    api = HfApi()
    whoami = api.whoami()
    hf_username = whoami['name']
    full_repo = f'{hf_username}/{HF_REPO}'
    print(f'   Logged in as: {hf_username}')
    print(f'   Repo: {full_repo}')
    api.create_repo(HF_REPO, exist_ok=True, private=False)
    print('   Uploading model files...')
    merged_model.push_to_hub(full_repo)
    tokenizer.push_to_hub(full_repo)
    print(f'\n✅ Model pushed! https://huggingface.co/{full_repo}')
    print(f'   Set in .env: T5_MODEL_PATH={full_repo}')
except Exception as e:
    print(f'\n❌ HuggingFace push failed: {e}')
    print('   Model is still saved locally at:', save_path)
    print('   Run Cell 9B manually to retry the push.')

In [None]:
# ============================================================
# CELL 7: Evaluate on validation set
# ============================================================
import sqlparse
import time

print('📊 Evaluating on validation set...\n')

model.eval()
device = model.device

exact_matches = 0
valid_sql = 0
total_time = 0.0
samples = []

for idx in range(len(val_ds)):
    inp = val_ds[idx]['input']
    expected = val_ds[idx]['target']

    encoded = tokenizer(inp, return_tensors='pt', max_length=MAX_INPUT_LEN, truncation=True, padding=True)
    input_ids = encoded.input_ids.to(device)
    attention_mask = encoded.attention_mask.to(device)

    t0 = time.time()
    outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LEN)
    total_time += (time.time() - t0) * 1000

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    if generated == expected:
        exact_matches += 1

    try:
        parsed = sqlparse.parse(generated)
        if parsed and parsed[0].get_type() and parsed[0].get_type().upper() == 'SELECT':
            valid_sql += 1
    except Exception:
        pass

    if len(samples) < 15:
        match = '✅' if generated == expected else '❌'
        samples.append((match, inp.split('query: ')[-1], expected, generated))

    if (idx + 1) % 100 == 0:
        print(f'  Evaluated {idx + 1}/{len(val_ds)}...')

total = len(val_ds)
em_acc = exact_matches / total * 100
exec_acc = valid_sql / total * 100
avg_ms = total_time / total

print()
print('=' * 60)
print('  EVALUATION RESULTS')
print('=' * 60)
print(f'  Validation examples:   {total}')
print(f'  Exact-match accuracy:  {em_acc:.1f}%')
print(f'  Valid SQL rate:        {exec_acc:.1f}%')
print(f'  Avg inference time:    {avg_ms:.1f} ms')
print('=' * 60)
print()

if em_acc >= 70:
    print('🎉 Great accuracy! Model is ready for deployment.')
elif em_acc >= 50:
    print('👍 Decent accuracy. Consider more training data or epochs.')
else:
    print('⚠️  Low accuracy. Check training data quality or try more epochs.')

print()
print('Sample predictions:')
print('-' * 60)
for match, question, expected, generated in samples[:10]:
    print(f'{match} Q: {question}')
    print(f'   Expected:  {expected[:100]}')
    print(f'   Generated: {generated[:100]}')
    print()

In [None]:
# ============================================================
# CELL 8: Interactive test — try your own questions
# ============================================================

SCHEMA_PREFIX_Q = 'tables: ai_documents (id, source_table, file_name, project_name, searchable_text, metadata, document_type) | query: '

def generate_sql(question):
    full_input = SCHEMA_PREFIX_Q + question
    encoded = tokenizer(full_input, return_tensors='pt', max_length=MAX_INPUT_LEN, truncation=True, padding=True)
    input_ids = encoded.input_ids.to(model.device)
    attention_mask = encoded.attention_mask.to(model.device)
    outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LEN)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

test_questions = [
    'show all expense files',
    'what are the total fuel expenses',
    'how many labor entries are in project alpha',
    'total cash flow amount for highway 5',
    'show approved quotations for manila tower',
    'total volume delivered for plate ABC-1234',
    'list all active projects',
    'average expense amount for materials',
    'how many deliveries used 10-wheeler trucks',
    'compare fuel costs between manila tower and building c',
]

print('🧪 Testing SQL generation:\n')
for q in test_questions:
    sql = generate_sql(q)
    print(f'Q: {q}')
    print(f'→ {sql}')
    print()

In [None]:
# ============================================================
# CELL 8B: Try your own question (edit and re-run)
# ============================================================

my_question = 'total expenses for SJDM project'  # <-- edit this!

sql = generate_sql(my_question)
print(f'Q: {my_question}')
print(f'→ {sql}')

In [None]:
# ============================================================
# CELL 9: Export model (zip for download)
# ============================================================
import shutil
from pathlib import Path

FINAL_MODEL = '/kaggle/working/t5-finetuned/final'

print('📦 Zipping model for download...')
zip_path = shutil.make_archive('/kaggle/working/t5-finetuned-model', 'zip', FINAL_MODEL)
print(f'   Size: {Path(zip_path).stat().st_size / 1024 / 1024:.0f} MB')

print(f'\n✅ Model zipped at: {zip_path}')
print('\n📥 To download:')
print('   1. Click the "Output" tab in the right panel')
print('   2. Find t5-finetuned-model.zip')
print('   3. Click the download icon')
print()
print('   Or save this notebook as a Kaggle Dataset to reuse the model.')

In [None]:
# ============================================================
# CELL 9B: Push to HuggingFace Hub (optional)
# ============================================================
# Uncomment and fill in your details to push to HF Hub

# from huggingface_hub import login
# login(token='YOUR_HF_TOKEN')
#
# HF_REPO = 'your-username/t5-auggregates-text2sql'  # <-- change this!
#
# model.push_to_hub(HF_REPO)
# tokenizer.push_to_hub(HF_REPO)
# print(f'✅ Pushed to https://huggingface.co/{HF_REPO}')

---
## 📝 After Training

**To use the fine-tuned model in production:**

1. Download the zip from Cell 9 (Output tab) or use the HF Hub link from Cell 9B
2. Extract to a folder on your server
3. Set `T5_MODEL_PATH` environment variable to the extracted folder path
4. The AI server will automatically use your fine-tuned model

**Expected accuracy targets:**
- Exact-match: 70%+ (good), 80%+ (great)
- Valid SQL rate: 95%+

**If accuracy is low:**
- Check training data quality (run Cell 4 validation)
- Try more epochs (change EPOCHS in Cell 6)
- Try lower learning rate (1e-4 instead of 3e-4)
- Add more diverse training pairs

**Kaggle tips:**
- GPU sessions last up to 12 hours
- You get 30 hours/week of GPU time
- Save your notebook to keep the output files