#  **Study 04: Experiments with reasoning LLMs**

**Authors:** Adonias Caetano de Oliveira, Ariel Soares Teles

**Institution:** UFDPar - PPGBiotec

**LLM:** Qwen3 without Reasoning


## **Installation**

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
!pip install Unidecode

Collecting Unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/235.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m235.8/235.8 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Unidecode
Successfully installed Unidecode-1.4.0


## **Dataset**

In [None]:
import pandas as pd
import re
import random
from unidecode import unidecode
from string import punctuation
import numpy as np

In [None]:
url = 'https://docs.google.com/spreadsheets/d/1RvhOd18FkFEMG8xqhO6JXNWawaKe1lsZ/edit?usp=sharing&ouid=107874559379957908991&rtpof=true&sd=true'
file_id = url.split('/')[-2]
read_url='https://drive.google.com/uc?id=' + file_id

# read the data
dataset_boamente = pd.read_excel(read_url)

dataset_boamente['target'] = dataset_boamente['target'].astype(int)

# display the first 5 rows
dataset_boamente

Unnamed: 0,text,target,label,answer,reasoning
0,Legitimamente engatilhado ao ver isso porque s...,0,n√£o cont√©m idea√ß√£o suicida,,
1,t√¥ t√£o triste queria s√≥ dormir pra sempre,1,cont√©m idea√ß√£o suicida,,
2,passei da fase de planejar meu suic√≠dio todo d...,1,cont√©m idea√ß√£o suicida,,
3,O ci√∫mes ainda vai me matar um dia,0,n√£o cont√©m idea√ß√£o suicida,,
4,sei l√° s√≥ queria dormir pra sempre,1,cont√©m idea√ß√£o suicida,,
5,sinto que o Arthur tem vontade de me matar cad...,0,n√£o cont√©m idea√ß√£o suicida,,
6,o marcelo eh mt chato vou me matar,0,n√£o cont√©m idea√ß√£o suicida,,
7,vou me matar comendo doritos flaming hot tchau,0,n√£o cont√©m idea√ß√£o suicida,,
8,vontade de dormir pra sempre,1,cont√©m idea√ß√£o suicida,,
9,s√≥ quero dormir pra sempre a,1,cont√©m idea√ß√£o suicida,,


In [None]:
def get_sentenca():
  random_index = dataset_boamente.sample(n=1).index[0]
  random_index = int(random_index)
  return dataset_boamente.loc[random_index, 'text'], dataset_boamente.loc[random_index, 'target']

In [None]:
def get_examples_by_target(quant):
  negativos = dataset_boamente.loc[dataset_boamente['target'] == 0].sample(n = int(quant/2))
  positivos = dataset_boamente.loc[dataset_boamente['target'] == 1].sample(n = int(quant/2))

  sentencas = list(negativos['text'].values) + list(positivos['text'].values)
  targets = list(negativos['target'].values) + list(positivos['target'].values)
  labels = list(negativos['label'].values) + list(positivos['label'].values)

  return sentencas, targets, labels

## **Unsloth**

In [None]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.3: Fast Qwen3 patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/144k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

Unsloth 2025.6.3 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


<a name="Data"></a>
## **Data Prep**
Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:

1. We use the [Open Math Reasoning]() dataset which was used to win the [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard) (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and whicht got > 95% accuracy.

2. We also leverage [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.

In [None]:
from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

README.md:   0%|          | 0.00/603 [00:00<?, ?B/s]

data/cot-00000-of-00001.parquet:   0%|          | 0.00/106M [00:00<?, ?B/s]

Generating cot split:   0%|          | 0/19252 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's see the structure of both datasets:

In [None]:
reasoning_dataset

Dataset({
    features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
    num_rows: 19252
})

In [None]:
non_reasoning_dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

We now convert the reasoning dataset into conversational format:

In [None]:
def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role" : "user",      "content" : problem},
            {"role" : "assistant", "content" : solution},
        ])
    return { "conversations": conversations, }

In [None]:
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize = False,
)

Map:   0%|          | 0/19252 [00:00<?, ? examples/s]

Let's see the first transformed row:

In [None]:
reasoning_conversations[0]

"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation ‚àö(x¬≤ + 165) - ‚àö(x¬≤ - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n‚àö(x¬≤ + 165) - ‚àö(x¬≤ - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n‚àö(x¬≤ + 165) = 7 + ‚àö(x¬≤ - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(‚àö(x¬≤ + 165))¬≤ = (7 + ‚àö(x¬≤ - 52))¬≤.\n\nSimplifying the left side:\n\nx¬≤ + 165 = 49 + 14‚àö(x¬≤ - 52) + (‚àö(x¬≤ - 52))¬≤.\n\nThe right side is expanded using the formula (a + b)¬≤ = a¬≤ + 2ab + b¬≤. So the right side

Next we take the non reasoning dataset and convert it to conversational format as well.

We have to use Unsloth's `standardize_sharegpt` function to fix up the format of the dataset first.

In [None]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)

non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize = False,
)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's see the first row

In [None]:
non_reasoning_conversations[0]

'<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</th

Now let's see how long both datasets are:

In [None]:
print(len(reasoning_conversations))
print(len(non_reasoning_conversations))

19252
100000


The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.

Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.

Let's select 25% reasoning and 75% chat based:

In [None]:
chat_percentage = 0.75

Let's sample the reasoning dataset by 25% (or whatever is 100% - chat_percentage)

In [None]:
import pandas as pd
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations) * (1.0 - chat_percentage)),
    random_state = 2407,
)

Finally combine both datasets:

In [None]:
data = pd.concat([
    pd.Series(reasoning_conversations),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)

<a name="Train"></a>
### **Train the model**
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/24065 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
7.395 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 24,065 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 87,293,952/8,000,000,000 (1.09% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.5721
2,0.6485
3,0.7873
4,0.6552
5,0.5454
6,0.5361
7,0.5282
8,0.4909
9,0.4432
10,0.544


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1389.7395 seconds used for training.
23.16 minutes used for training.
Peak reserved memory = 10.135 GB.
Peak reserved memory for training = 2.74 GB.
Peak reserved memory % of max memory = 68.754 %.
Peak reserved memory for training % of max memory = 18.588 %.


## **Experimentos de identifica√ß√£o de idea√ß√£o suicida**

<a name="Inference"></a>
### **Inference**
De acordo com a equipe `Qwen-3`, as configura√ß√µes recomendadas para infer√™ncia de racioc√≠nio s√£o `temperature = 0.6, top_p = 0.95, top_k = 20`

Para infer√™ncia baseada em bate-papo normal, `temperature = 0.7, top_p = 0.8, top_k = 20`

In [None]:
def get_prompt(sentenca, label):
  return f"A senten√ßa \"{sentenca}\" {label}. Por qu√™?"

In [None]:
prompt = get_prompt("o marcelo eh mt chato vou me matar", "n√£o cont√©m idea√ß√£o suicida	")

messages = [
      {"role" : "user", "content" :  prompt}
  ]
text = tokenizer.apply_chat_template(
  messages,
  tokenize = False,
  add_generation_prompt = True, # Must add for generation
  enable_thinking = True, # Disable thinking
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(text, return_tensors = "pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=1024, temperature = 0.6, top_p = 0.95, top_k = 20)
resposta_ids = output[0]
resposta_texto = tokenizer.decode(resposta_ids, skip_special_tokens=True)
resposta_texto

'user\nA senten√ßa "o marcelo eh mt chato vou me matar" n√£o cont√©m idea√ß√£o suicida\t. Por qu√™?\nassistant\n<think>\nOkay, the user is asking why the sentence "o marcelo eh mt chato vou me matar" doesn\'t contain suicidal ideation. Let me break this down.\n\nFirst, the sentence is in Portuguese. The user wrote "mt" for "muito," so it\'s "o Marcelo √© muito chato vou me matar." Translating that: "Marcelo is really annoying, I\'m going to kill myself." \n\nNow, the user is asking why this doesn\'t indicate suicidal ideation. But wait, the sentence does mention "vou me matar," which is a direct statement about wanting to kill oneself. That\'s a red flag for suicidal ideation. So why is the user asking if it doesn\'t contain it?\n\nMaybe there\'s a misunderstanding. The user might think that because the sentence is in Portuguese, it\'s not considered suicidal ideation in their context. Or perhaps they\'re confused about the translation. Alternatively, maybe the user is testing if I can

In [None]:
predicoes = {}
sentencas, targets, labels = get_examples_by_target(10)

predicoes["text"] = sentencas
predicoes["target"] = targets
predicoes["label"] = labels
predicoes["answer"] = []

for sentenca, label in zip(sentencas, labels):

  prompt = get_prompt(sentenca, label)

  messages = [
      {"role" : "user", "content" :  prompt}
  ]
  text = tokenizer.apply_chat_template(
      messages,
      tokenize = False,
      add_generation_prompt = True, # Must add for generation
      enable_thinking = True, # Disable thinking
  )

  FastLanguageModel.for_inference(model) # Enable native 2x faster inference

  inputs = tokenizer(text, return_tensors = "pt").to("cuda")
  output = model.generate(**inputs, max_new_tokens=1024, temperature = 0.6, top_p = 0.95, top_k = 20)
  resposta_ids = output[0]
  resposta_texto = tokenizer.decode(resposta_ids, skip_special_tokens=True)
  resposta_texto

  predicoes["answer"].append(resposta_texto)

In [None]:

from google.colab import files

df = pd.DataFrame(predicoes)
df.to_excel('respostas_answer.xlsx', index=False)

files.download('respostas_answer.xlsx')
