<a href="https://colab.research.google.com/github/annanasnas/askqe/blob/main/Golden_Fine_Tuning_Judge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create Training Data

### Load data

In [1]:
!git clone https://github.com/dayeonki/askqe

Cloning into 'askqe'...
remote: Enumerating objects: 1119, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 1119 (delta 14), reused 5 (delta 5), pack-reused 1102 (from 2)[K
Receiving objects: 100% (1119/1119), 52.45 MiB | 15.02 MiB/s, done.
Resolving deltas: 100% (886/886), done.
Updating files: 100% (1037/1037), done.


In [1]:
import sys

HF_DATASET = "zouharvi/bio-mqm-dataset"
ASKQE_PATH = "askqe/biomqm/dev_with_backtranslation.jsonl"

In [2]:
import unicodedata

def normalize_text(text):
    if not text:
        return ""
    text = unicodedata.normalize('NFC', text.strip()).replace('\u200b', '').replace('\ufeff', '')
    return ' '.join(text.split())

In [3]:
# ASKQE data
import json

askqe_data = []
exclusion_keys = set()

with open(ASKQE_PATH, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            item = json.loads(line)
            askqe_data.append(item)
            if item.get('src') and item.get('tgt'):
                key = (normalize_text(item['src']).lower(), normalize_text(item['tgt']).lower())
                exclusion_keys.add(key)
        except: pass

print(f"Loaded {len(askqe_data)} samples from ASKQE repo. Exclusion keys: {len(exclusion_keys)}")

Loaded 5216 samples from ASKQE repo. Exclusion keys: 3144


In [4]:
# HF data
from datasets import load_dataset

hf_data = []
for split in ["validation", "test"]:
    ds = load_dataset(HF_DATASET, split=split)
    hf_data.extend([dict(x) for x in ds])
print(f"Loaded {len(hf_data)} raw HF samples.")

Loaded 62173 raw HF samples.


In [5]:
from collections import Counter

# language distribution
langs = Counter(item.get('lang_src', 'unknown') for item in hf_data)
print("\nLanguage Distribution:")
for lang, cnt in langs.most_common():
    print(f"   {lang}: {cnt} ({100*cnt/len(hf_data):.1f}%)")


Language Distribution:
   en: 29642 (47.7%)
   de: 7124 (11.5%)
   zh: 5968 (9.6%)
   es: 5832 (9.4%)
   fr: 5265 (8.5%)
   ru: 4329 (7.0%)
   pt: 2388 (3.8%)
   it: 1625 (2.6%)


## Prepare & clean data

In [6]:
PROMPT = """You are an expert translation quality evaluator (STRICT MODE).

Task: Compare the semantic meaning of the Source Sentence and the Target Sentence (Translation).
You MUST be conservative: if you are not sure the meaning is identical, do NOT output "NONE".
When uncertain between two labels, choose the MORE SEVERE one.

Source Sentence: {source}
Target Sentence: {target}

How to judge (follow in this order):
1) Extract from the Source the key meaning units: entities, numbers/units, negation/polarity, modality (must/should/can), time/tense, and the main predicate + roles (who did what to whom).
2) Check each unit against the Target Translation.

Decision rules:
- CRITICAL if ANY of these occur:
  a) Expansion (Impact): any added claim/detail that introduces new meaning (not just obvious/implicit filler).
  b) Omission: any missing word/phrase that removes a meaning unit or changes what is asserted.
  c) Alteration: antonym, polarity/negation flip, different actor/object, different time, different condition, different outcome.
  d) Numbers/units/dates/names change or mismatch.
  e) Safety risk: the target could change an instruction, warning, permission, or prohibition.

- MAJOR if the core topic remains but an important detail/constraint is changed or blurred (scope, intensity, condition, timeframe, responsibility), without a full contradiction.

- MINOR ONLY if the difference is exclusively one (or more) of these CONTRATICO-minor perturbations and does NOT change truth conditions:
  • Spelling (1–2 words)
  • Word order
  • Synonym (same meaning, no register shift)
  • Intensifier (small emphasis change, no change to factual claim)
  • Expansion (No Impact): adds only contextually obvious/implicit info, no new proposition

- NONE only if semantically equivalent AND no meaning units are added/omitted/altered.

Output JSON only:
{{"classification":"NONE|MINOR|MAJOR|CRITICAL","reason":"brief explanation"}}"""

def get_severity(errors_list):
    if not errors_list:
        return "NONE"
    sevs = [e.get("severity", "Minor").capitalize() for e in errors_list]
    for s in ["Critical", "Major", "Minor"]:
        if s.upper() in [x.upper() for x in sevs]:
            return s.upper()
    return "NONE"

def make_reason(errors_list):
    if not errors_list:
        return "Semantically equivalent translation with no detected errors."
    parts = []
    for e in errors_list:
        cat = e.get("error_category", e.get("type", "unknown"))
        sub = e.get("error_subcategory", "")
        sev = e.get("severity", "Minor")
        term = e.get("term", e.get("text", ""))

        # format string like: Minor Accuracy/Translation: 'word'
        etype = f"{cat}/{sub}" if sub else cat
        parts.append(f"{sev} {etype}: '{term}'" if term else f"{sev} {etype}")
    return "; ".join(parts)

def create_example(src, tgt, errors):
    verdict = get_severity(errors)
    reason = make_reason(errors)

    user_msg = PROMPT.format(source=src, target=tgt)
    assistant_msg = json.dumps({"classification": verdict, "reason": reason}, ensure_ascii=False)

    return {
        "messages": [
            {"role": "user", "content": user_msg},
            {"role": "assistant", "content": assistant_msg}
        ],
        "text": user_msg + "\n\n" + assistant_msg,
        "classification": verdict
    }

In [7]:
from collections import Counter

MIN_CHAR_LEN = 10
MAX_CHAR_LEN = 2000
MIN_WORD_COUNT = 3
MAX_WORD_COUNT = 500
MAX_SPECIAL_CHAR_RATIO = 0.3

def validate_errors(errors):
    if errors is None:
        return True
    if not isinstance(errors, list):
        return False
    valid = {"Minor", "Major", "Critical", "minor", "major", "critical"}
    return all(isinstance(e, dict) and (not e.get("severity") or e.get("severity") in valid) for e in errors)

def is_bad_quality(text):
    if not text:
        return True
    # encoding check
    if any(p in text for p in ['\ufffd', '�', '\x00']):
        return True
    if sum(1 for c in text if unicodedata.category(c) == 'Cc') > len(text) * 0.05:
        return True

    # special char ratio check
    allowed_symbols = '.,;:!?-\'"()[]{}'
    normal = sum(1 for c in text if c.isalnum() or c.isspace() or c in allowed_symbols)
    ratio = 1.0 - normal / len(text)
    return ratio > MAX_SPECIAL_CHAR_RATIO

def filter_data(raw_items, keys_to_exclude, is_local=False):
    cleaned = []
    stats = Counter()
    seen_internal = set(keys_to_exclude) if keys_to_exclude else set()

    for item in raw_items:
        # 1. language check
        if not is_local and item.get('lang_src', '').lower() != "en":
            stats['non_english'] += 1
            continue

        src = normalize_text(item.get('src'))
        tgt = normalize_text(item.get('tgt'))

        # 2. basic integrity
        if not src or not tgt:
            stats['empty'] += 1
            continue
        if src.lower() == tgt.lower():
            stats['src_eq_tgt'] += 1
            continue

        # 3. length checks (words & chars)
        src_words, tgt_words = len(src.split()), len(tgt.split())
        if len(src) < MIN_CHAR_LEN or len(tgt) < MIN_CHAR_LEN or src_words < MIN_WORD_COUNT:
            stats['too_short'] += 1
            continue
        if len(src) > MAX_CHAR_LEN or len(tgt) > MAX_CHAR_LEN or src_words > MAX_WORD_COUNT:
            stats['too_long'] += 1
            continue

        # 4. quality checks
        if is_bad_quality(src) or is_bad_quality(tgt):
            stats['quality_issue'] += 1
            continue

        # 5. deduplication
        key = (src.lower(), tgt.lower())
        if key in seen_internal:
            stats['duplicate'] += 1
            continue
        seen_internal.add(key)

        # 6. error structure validation
        if not validate_errors(item.get('errors_tgt', [])):
            stats['invalid_errors'] += 1
            continue

        cleaned.append(create_example(src, tgt, item.get('errors_tgt', [])))

    print(f"   Input: {len(raw_items)} -> Output: {len(cleaned)}")
    print(f"   Filters: {dict(stats)}")
    return cleaned

print("\nProcessing Training Data (HF)...")
train_ex = filter_data(hf_data, keys_to_exclude=exclusion_keys, is_local=False)

print("\nProcessing Validation Data (ASKQE)...")
val_ex = filter_data(askqe_data, keys_to_exclude=None, is_local=True)



Processing Training Data (HF)...
   Input: 62173 -> Output: 13138
   Filters: {'duplicate': 15405, 'src_eq_tgt': 82, 'invalid_errors': 926, 'non_english': 32531, 'too_short': 89, 'quality_issue': 2}

Processing Validation Data (ASKQE)...
   Input: 5216 -> Output: 2884
   Filters: {'duplicate': 2056, 'invalid_errors': 256, 'src_eq_tgt': 18, 'too_short': 2}


In [9]:
train_ex[:1]

[{'messages': [{'role': 'user',
   {'role': 'assistant',
    'content': '{"classification": "NONE", "reason": "Semantically equivalent translation with no detected errors."}'}],
  'classification': 'NONE'}]

## Balance data

In [8]:
import pandas as pd
import json

df = pd.DataFrame(train_ex)

def get_classification(messages):
    for msg in messages:
        if msg['role'] == 'assistant':
            try:
                content = json.loads(msg['content'])
                return content.get('classification')
            except (json.JSONDecodeError, AttributeError):
                return None
    return None

df['severety_class'] = df['messages'].apply(get_classification)
counts = df['severety_class'].value_counts()

print("Counts by category:")
print(counts)

print("\nIn percentages:")
print(df['severety_class'].value_counts(normalize=True) * 100)

Counts by category:
severety_class
NONE        5531
MINOR       4542
MAJOR       2725
CRITICAL     340
Name: count, dtype: int64

In percentages:
severety_class
NONE        42.099254
MINOR       34.571472
MAJOR       20.741361
CRITICAL     2.587913
Name: proportion, dtype: float64


In [9]:
import pandas as pd

df = pd.DataFrame(train_ex)

crit = df[df['classification'] == 'CRITICAL']
maj  = df[df['classification'] == 'MAJOR']
min_ = df[df['classification'] == 'MINOR']
none = df[df['classification'] == 'NONE']

crit_bal = pd.concat([crit] * 3)

maj_bal  = maj.sample(n=min(len(maj), 1500), random_state=42)
min_bal  = min_.sample(n=min(len(min_), 1500), random_state=42)
none_bal = none.sample(n=min(len(none), 1500), random_state=42)

df_balanced = pd.concat([crit_bal, maj_bal, min_bal, none_bal])
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

train_ex = df_balanced.to_dict('records')

print(f"New size train_ex: {len(train_ex)}")
print(df_balanced['classification'].value_counts())

New size train_ex: 5520
classification
MAJOR       1500
NONE        1500
MINOR       1500
CRITICAL    1020
Name: count, dtype: int64


## Saving

In [10]:
import random

RANDOM_SEED = 42

OUTPUT_TRAIN_PATH = "judge_training_data.jsonl"
OUTPUT_VAL_PATH = "judge_validation_data.jsonl"

random.seed(RANDOM_SEED)
random.shuffle(train_ex)

print("\nSaving...")
with open(OUTPUT_TRAIN_PATH, 'w', encoding='utf-8') as f:
    for ex in train_ex:
        f.write(json.dumps({"messages": ex["messages"], "text": ex["text"]}, ensure_ascii=False) + "\n")

with open(OUTPUT_VAL_PATH, 'w', encoding='utf-8') as f:
    for ex in val_ex:
        f.write(json.dumps(ex, ensure_ascii=False) + "\n")

print(f"Done. Train: {len(train_ex)}, Val: {len(val_ex)}")


Saving...
Done. Train: 5520, Val: 2884


# Fine-Tuning

In [13]:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q datasets trl accelerate bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m144.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.4/566.4 kB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m376.5/376.5 kB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m131.2 MB/s[0m eta [36m0

## Load data

In [11]:
PATHS = {
    "train": "judge_training_data.jsonl",
    "val": "judge_validation_data.jsonl",
    "output": "judge_unsloth_final",
    "drive": "/content/drive/MyDrive/LLM-Judge-FineTuned"
}


In [12]:
from datasets import Dataset
import json


def read_jsonl(path):
    with open(path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]

train_raw = read_jsonl(PATHS["train"])
val_raw = read_jsonl(PATHS["val"])

# train_raw = train_raw[:100] # change here if needed to avoid reducing
val_raw = val_raw[:1500] # change here if needed to avoid reducing

print(f"Loaded {len(train_raw)} training and {len(val_raw)} validation examples")
train_ds, val_ds = Dataset.from_list(train_raw), Dataset.from_list(val_raw)

Loaded 5520 training and 1500 validation examples


In [17]:
train_ds[:1]

    'role': 'user'},
   {'content': '{"classification": "MINOR", "reason": "Minor Accuracy/Mistranslation: \'幅度\'"}',
    'role': 'assistant'}]],

## Model Init

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only
from trl import SFTTrainer
from transformers import TrainingArguments

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [15]:
MODEL_NAME = "unsloth/Qwen3-4B-Instruct-2507-bnb-4bit"
MAX_SEQ_LENGTH = 2048

In [16]:
print(f"\nLoading model: {MODEL_NAME}")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=True
)


Loading model: unsloth/Qwen3-4B-Instruct-2507-bnb-4bit
==((====))==  Unsloth 2026.2.1: Fast Qwen3 patching. Transformers: 4.57.6.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.494 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

In [17]:
LORA_CONFIG = {
    "r": 64,
    "lora_alpha": 128,
    "lora_dropout": 0,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    "bias": "none",
    "use_gradient_checkpointing": "unsloth",
    "random_state": 42,
    "use_rslora": False,
    "loftq_config": None
}


print(f"Configuring LoRA...")
model = FastLanguageModel.get_peft_model(model, **LORA_CONFIG)
model.print_trainable_parameters()

Configuring LoRA...


Unsloth 2026.2.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


trainable params: 132,120,576 || all params: 4,154,588,672 || trainable%: 3.1801


## Train config

In [18]:
def apply_template(batch):
    # converts "messages" to raw text
    texts = [
        tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
        for msgs in batch["messages"]
    ]
    return {"text": texts}

train_ds = train_ds.map(apply_template, batched=True, remove_columns=["messages"])
val_ds = val_ds.map(apply_template, batched=True, remove_columns=["messages"])

Map:   0%|          | 0/5520 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [19]:
TRAIN_CONFIG = {
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 8,
    "learning_rate": 2e-4,
    "warmup_ratio": 0.05,
    "weight_decay": 0.01,
    "fp16": not is_bfloat16_supported(),
    "bf16": is_bfloat16_supported(),
    "optim": "adamw_8bit",
    "max_grad_norm": 0.3,
    "seed": 42,
    "data_seed": 42,
    "group_by_length": True,
    "report_to": "none",
}

print(f"\nTraining configuration: Epochs={TRAIN_CONFIG['num_train_epochs']}, BS={TRAIN_CONFIG['per_device_train_batch_size']}")

training_args = TrainingArguments(
    output_dir=PATHS["output"],
    logging_dir=f"{PATHS['output']}/logs",
    logging_strategy="steps", logging_steps=10, logging_first_step=True,
    eval_strategy="steps", eval_steps=100,
    save_strategy="steps", save_steps=100, save_total_limit=2,
    load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False,
    lr_scheduler_type="cosine",
    **TRAIN_CONFIG
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=False,
    args=training_args,
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|im_start|>user\n",
    response_part="<|im_start|>assistant\n"
)


Training configuration: Epochs=3, BS=2


Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/5520 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/1500 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/5520 [00:00<?, ? examples/s]

Filter (num_proc=16):   0%|          | 0/5520 [00:00<?, ? examples/s]

Unsloth: Removed 3 out of 5520 samples from train_dataset where all labels were -100 (no response found after truncation). This prevents NaN loss during training.


Map (num_proc=16):   0%|          | 0/1500 [00:00<?, ? examples/s]

Filter (num_proc=16):   0%|          | 0/1500 [00:00<?, ? examples/s]

## Training...

In [20]:
print(f"\n{'=' * 70}\nSTARTING TRAINING\n{'=' * 70}")
stats = trainer.train()
print(f"\nTraining complete: {stats.metrics['train_runtime']:.1f}s")


STARTING TRAINING


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,517 | Num Epochs = 3 | Total steps = 1,035
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 132,120,576 of 4,154,588,672 (3.18% trained)


Step,Training Loss,Validation Loss
100,0.2964,0.383476
200,0.2725,0.341863
300,0.2781,0.345818
400,0.2802,0.320099
500,0.2626,0.301041
600,0.2577,0.305003
700,0.1869,0.304818
800,0.1746,0.319338
900,0.1618,0.320254
1000,0.1681,0.317809


Unsloth: Not an error, but Qwen3ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient



Training complete: 3952.1s


## Save

In [21]:
import os
from google.colab import drive

# save models
print(f"\nSaving locally to: {PATHS['output']}")
# save adapter
model.save_pretrained(PATHS['output'])
tokenizer.save_pretrained(PATHS['output'])

# save merged
merged_path = f"{PATHS['output']}_merged_16bit"
print("Saving merged 16-bit model...")
model.save_pretrained_merged(merged_path, tokenizer, save_method="merged_16bit")

# export to Drive
drive.mount('/content/drive')
os.system(f"cp -rf {merged_path} {PATHS['drive']}")
os.system(f"cp -rf {PATHS['output']} {PATHS['drive']}")



Saving locally to: judge_unsloth_final
Saving merged 16-bit model...


config.json: 0.00B [00:00, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|█████     | 1/2 [00:18<00:18, 18.63s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:41<00:00, 20.64s/it]


Note: tokenizer.model not found (this is OK for non-SentencePiece models)


Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [00:29<00:00, 14.81s/it]


Unsloth: Merge process complete. Saved to `/content/judge_unsloth_final_merged_16bit`
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


0

Checking saving

In [22]:
from unsloth import FastLanguageModel
from google.colab import drive
import torch

drive.mount('/content/drive')
DRIVE_MODEL_PATH = "/content/drive/MyDrive/LLM-Judge-FineTuned/judge_unsloth_final_merged_16bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = DRIVE_MODEL_PATH,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

src_text = "The patient shows signs of severe hypotension."
tgt_text = "У пациента наблюдаются признаки тяжелой гипертонии."

input_prompt = PROMPT.format(source=src_text, target=tgt_text)

messages = [{"role": "user", "content": input_prompt}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=256, use_cache=True, temperature=0.1)

response_text = tokenizer.batch_decode(outputs)[0]

print("-" * 50)
print("RAW OUTPUT:")
print(response_text.split("assistant\n")[-1])
print("-" * 50)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
==((====))==  Unsloth 2026.2.1: Fast Qwen3 patching. Transformers: 4.57.6.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.494 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Генерирую ответ...
--------------------------------------------------
RAW OUTPUT:
<think>

</think>

{"classification": "MAJOR", "reason": "Major Terminology/Wrong_term: 'гипертонии'"}<|im_end|>
--------------------------------------------------
