# Polish Gender-Inclusive Proofreading with Qwen3-8B

This notebook fine-tunes the **Qwen3-8B model** for the IPIS Polish gender-inclusive proofreading task (Subtask A).

**Task Overview:**
- Transform standard Polish text into gender-inclusive language
- Handle various inclusive forms: asterisks (*), slashes (/), and conjunctions
- Follow Polish grammar agreement rules

**Dataset:** IPIS-proofreading from PolEval 2025

Based on the Unsloth template. Licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

### Installation

In [None]:
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='tqdm')

In [None]:
# %%capture
# import os, re
# if "COLAB_" not in "".join(os.environ.keys()):
#     !pip install unsloth
# else:
#     # Do this only in Colab notebooks! Otherwise use pip install unsloth
#     import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
#     xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
#     !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
#     !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
#     !pip install --no-deps unsloth
# !pip install transformers==4.56.2
# !pip install --no-deps trl==0.22.2

### Load Model - Qwen3-8B

In [None]:
MODEL_SIZE = "4B"  # Choose between "4B" and "8B"
LORA_RANK = 32  # Choose between 16 and 32; rank=r=alpha=lora_alpha
EPOCHS = 1.5
BATCH_SIZE = 1  # Adjust based on your GPU memory
GRADIENT_ACCUMULATION_STEPS = 2  # To simulate larger batch size
LEARNING_RATE = 2e-4
WARMUP_STEPS = 10
MAX_SEQ_LENGTH = 4096  
SEED = 3407

In [None]:
# Fix HuggingFace cache permissions issue by using a local cache directory
import os
os.environ['HF_HOME'] = '/mnt/d/Pobrane/poleval-gender/.cache/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/mnt/d/Pobrane/poleval-gender/.cache/huggingface/transformers'
os.environ['HF_DATASETS_CACHE'] = '/mnt/d/Pobrane/poleval-gender/.cache/huggingface/datasets'

# Fix Triton cache permissions
os.environ['TRITON_CACHE_DIR'] = '/mnt/d/Pobrane/poleval-gender/.cache/triton'

# Create the directories if they don't exist
os.makedirs('/mnt/d/Pobrane/poleval-gender/.cache/huggingface', exist_ok=True)
os.makedirs('/mnt/d/Pobrane/poleval-gender/.cache/triton', exist_ok=True)

In [None]:
import mlflow

# Configure MLflow experiment
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment(f"qwen3-{MODEL_SIZE}-polish-inclusive-proofreading")




In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = f"unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit" if MODEL_SIZE == "4B" else f"unsloth/Qwen3-8B-unsloth-bnb-4bit",
    max_seq_length = MAX_SEQ_LENGTH,  # Increased for longer Polish texts
    load_in_4bit = True,
    load_in_8bit = False,
    full_finetuning = False,
    # token = "hf_...", # use one if using gated models
)

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = LORA_RANK, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = LORA_RANK,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = SEED,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

<a name="Data"></a>
### Data Prep - IPIS Polish Gender-Inclusive Proofreading Dataset

Loading the IPIS-proofreading dataset for Polish gender-inclusive language transformation. The dataset contains:
- **prompt**: Task instruction (various Polish phrasings)
- **source**: Input text in standard Polish
- **target**: Expected gender-inclusive output
- **messages**: Pre-formatted conversation with user/assistant roles

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen3-instruct",
)

In [None]:
from datasets import load_dataset
import json
from datasets import Dataset


# Load IPIS dataset from local JSONL file
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

# Load train set for TRAINING
train_data = load_jsonl('data/taskA/train.jsonl')
train_dataset = Dataset.from_list(train_data)

# Load dev set for VALIDATION
dev_data = load_jsonl('data/taskA/dev.jsonl')
dev_dataset = Dataset.from_list(dev_data)

print(f"Loaded {len(train_dataset)} training examples")
print(f"Loaded {len(dev_dataset)} validation examples")
print(f"Columns: {train_dataset.column_names}")

Let's examine a sample from the dataset:

In [None]:
# Show example with gender-inclusive transformations
example = train_dataset[6]  # This one has actual transformations
print("=" * 60)
print("PROMPT:", example['prompt'])
print("=" * 60)
print("SOURCE:", example['source'])
print("=" * 60)
print("TARGET:", example['target'])
print("=" * 60)

### Add System Prompt and Apply Chat Template

We'll add the Polish gender-inclusive editing system prompt to each conversation and apply the Qwen3 chat template.

In [None]:
# Load the Polish system prompt
with open('system_prompts/proofreading/system_prompt_pl_proofreading', 'r', encoding='utf-8') as f:
    SYSTEM_PROMPT = f.read().strip()

def add_system_prompt_and_format(examples):
    """Add system prompt to messages and apply chat template"""
    texts = []
    
    for messages in examples['messages']:
        # Add system prompt at the beginning
        full_messages = [
            {"role": "system", "content": SYSTEM_PROMPT}
        ] + messages
        
        # Apply chat template
        text = tokenizer.apply_chat_template(
            full_messages, 
            tokenize=False, 
            add_generation_prompt=False
        )
        texts.append(text)
    
    return {"text": texts}

train_dataset = train_dataset.map(add_system_prompt_and_format, batched=True)
dev_dataset = dev_dataset.map(add_system_prompt_and_format, batched=True)
print("Datasets formatted with system prompt and chat template")

Let's verify the formatted text with system prompt:

In [None]:
print(train_dataset[6]['text'][:1000] + "...")  # Show first 1000 chars

<a name="Train"></a>
### Train the model

Training parameters optimized for the Polish gender-inclusive proofreading task. We'll use a full epoch over the dataset.

In [None]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = dev_dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = BATCH_SIZE,
        per_device_eval_batch_size = 1,  # Smaller batch size for evaluation to avoid OOM
        gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
        warmup_steps = WARMUP_STEPS,
        num_train_epochs = EPOCHS,
        max_steps = -1,  # Let it run for full epochs
        learning_rate = LEARNING_RATE,
        logging_steps = 10,
        eval_strategy = "steps",  # Evaluate at fixed step intervals
        eval_steps = 500,  # Evaluate every 500 steps (approximately 0.5 epoch with current settings)
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = SEED,
        output_dir = "outputs",
        save_strategy = "steps",
        save_steps = 500,  # Save at same intervals as evaluation
        report_to = "mlflow",
        load_best_model_at_end = True,
        metric_for_best_model = "loss",
        run_name = f"lora_r{LORA_RANK}_lr{LEARNING_RATE}_ep{EPOCHS}_bs{BATCH_SIZE}_ga{GRADIENT_ACCUMULATION_STEPS}_warmup{WARMUP_STEPS}_seq{MAX_SEQ_LENGTH}",
    ),
)


We use Unsloth's `train_on_responses_only` to train only on the assistant's gender-inclusive outputs, not the user's input text.

In [None]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

Verify that masking is applied correctly - we should only see the assistant's response:

In [None]:
# Show full input
print("FULL INPUT:")
print(tokenizer.decode(trainer.train_dataset[6]["input_ids"])[:500] + "...")

Now let's see the masked labels (only assistant response should be visible):

In [None]:
print("MASKED OUTPUT (training target):")
masked = tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[6]["labels"]]).replace(tokenizer.pad_token, " ")
print(masked)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference - Test Gender-Inclusive Proofreading

Let's test the model on a Polish text that requires gender-inclusive transformation.

In [None]:
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Test with a Polish text requiring gender-inclusive transformation
test_text = """Każdy pracownik ma prawo do urlopu. Nauczyciel powinien przygotować się do lekcji. Studenci uczestniczą w wykładach."""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Przekształć tekst polski na jego wersję inkluzywną: {test_text}"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)

from transformers import TextStreamer
print("=" * 60)
print("INPUT:", test_text)
print("=" * 60)
print("OUTPUT:")
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 512,
    temperature = 0.3,  # Lower temperature for more precise transformations
    top_p = 0.9, 
    top_k = 50,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<a name="Save"></a>
### Saving the Fine-tuned Model

Save the model for the Polish gender-inclusive proofreading task.

In [None]:
model.save_pretrained(f"Qwen3_{MODEL_SIZE}_polish_inclusive_proofreading_lora_r{LORA_RANK}_ep{EPOCHS}")
tokenizer.save_pretrained(f"Qwen3_{MODEL_SIZE}_polish_inclusive_proofreading_lora_r{LORA_RANK}_ep{EPOCHS}")
print(f"Model saved to Qwen3_{MODEL_SIZE}_polish_inclusive_proofreading_lora_r{LORA_RANK}_ep{EPOCHS}/")

# Optional: push to Hugging Face Hub
# model.push_to_hub("your_username/qwen3-8b-polish-inclusive", token = "...")
# tokenizer.push_to_hub("your_username/qwen3-8b-polish-inclusive", token = "...")

### Load the saved model for inference:

In [None]:
if True:  # Set to True to load the saved model
    from unsloth import FastLanguageModel
    model2, tokenizer2 = FastLanguageModel.from_pretrained(
        model_name = f"Qwen3_{MODEL_SIZE}_polish_inclusive_proofreading_lora_r{LORA_RANK}_ep{EPOCHS}",
        max_seq_length = 4096,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model2)

In [None]:
# do sample inference with loaded model
test_text = """Każdy pracownik ma prawo do urlopu. Nauczyciel powinien przygotować się do lekcji. Studenci uczestniczą w wykładach."""
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Przekształć tekst polski na jego wersję inkluzywną: {test_text}"}
]
text = tokenizer2.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
)
inputs = tokenizer2(text, return_tensors="pt").to("cuda")
outputs = model2.generate(
    **inputs,
    max_new_tokens = 512,
    temperature = 0.3,
    top_p = 0.9,
)

In [None]:
print("=" * 60)
print("INPUT:", test_text)
print("=" * 60)
print("OUTPUT:")
_ = model2.generate(
    **tokenizer2(text, return_tensors = "pt").to("cuda"),
    streamer = TextStreamer(tokenizer2, skip_prompt = True),

)