# Unsloth 제공하는 model정리   
https://docs.unsloth.ai/get-started/unsloth-notebooks     
https://huggingface.co/unsloth  




Hugging Face Qwen3 : https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF  


Hugging Face DeepSeekR1 : https://huggingface.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF  



이 notebook source :
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb

## Installation

In [1]:
%%capture
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install transformers==4.51.3
!pip install --no-deps unsloth

## Unsloth : 사용가능한 주요 최신 모델  

In [2]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [                         # Qwen3 계열 한국어 지원
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit", # Qwen 14B 2x faster
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",

    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",  ## 한국어 지원
    "unsloth/Phi-4",
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] We support TTS models!
] # More models at https://huggingface.co/unsloth

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## 사용할 모델 Load

In [3]:
%%time

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
#    model_name = "unsloth/Qwen3-14B",
    model_name = "unsloth/Qwen3-14B-unsloth-bnb-4bit",
    max_seq_length = 2048,   # Context length, 길면 메모리 더 필요
    load_in_4bit = True,     # 4bit 사용으로 메모리 감소
    load_in_8bit = False,    # 더 정밀해짐, 2배 메모리 필요
    full_finetuning = False, # LoRa 사용 예정
    device_map = "auto",
)
# Wall time: 1min 38s

==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/168k [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.59G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

CPU times: user 44.2 s, sys: 53.3 s, total: 1min 37s
Wall time: 1min 30s


In [4]:
# Test the model with a simple inference
inputs = tokenizer("안녕, 넌 누구니?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

안녕, 넌 누구니? 
안녕! 나는 Qwen, 알리바바 그룹이 개발한 초대규모 언어 모델이야. 다양한 주제에 대해 대화를 나누고, 질문에 답하고, 창의적인


In [5]:
## model 구조 확인 ##
model

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 5120, padding_idx=151654)
    (layers): ModuleList(
      (0-5): 6 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=17408, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=17408, bias=False)
          (down_proj): Linear4bit(in_features=17408, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
   

## LoRA 적용

LoRA 어댑터가 추가되어 모든 매개변수의 1%에서 10%까지만 업데이트

In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,                     # rank : 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,            # rank or rank*2
    lora_dropout = 0,           # 0 is optimized
    bias = "none",              # "none" is optimized
    use_gradient_checkpointing = "unsloth",# "unsloth":메모리 절약됨
    random_state = 3407,
    use_rslora = False,         # Rank Stabilized LoRA, 아직 실험적
    loftq_config = None,        # And LoftQ, 매우 작은모델에 유용
)

Unsloth 2025.6.2 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


In [7]:
## LoRA 적용된 구조 확인
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen3ForCausalLM(
      (model): Qwen3Model(
        (embed_tokens): Embedding(151936, 5120, padding_idx=151654)
        (layers): ModuleList(
          (0-5): 6 x Qwen3DecoderLayer(
            (self_attn): Qwen3Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=5120, out_features=5120, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.

## Data 준비
Qwen3에는 추론 모드와 비 추론 모드가 모두 있음. 따라서 2개의 데이터 세트를 사용해야 함:
1. [AIMO](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard)(AI Mathematical Olympiad - Progress Prize 2) 챌린지에서 우승하는 데 사용된 Open Math Reasoning 데이터 세트를 사용합니다.  
 DeepSeek R1에서 사용한 검증가능한추론추적(verifiable reasoning trace)에서 10%를 샘플링하고 > 95%의 정확도를 얻었음   

2. 또한 ShareGPT 스타일로 [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) 데이터 세트를 사용   
HuggingFace의 일반 multiturn format으로도 변환해서 사용  

In [8]:
from datasets import load_dataset
reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

README.md:   0%|          | 0.00/603 [00:00<?, ?B/s]

data/cot-00000-of-00001.parquet:   0%|          | 0.00/106M [00:00<?, ?B/s]

Generating cot split:   0%|          | 0/19252 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

두 데이터 세트의 구조

In [9]:
reasoning_dataset

Dataset({
    features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
    num_rows: 19252
})

In [10]:
# 보고 싶은 필드 목록
features = [
    'expected_answer',       # 정답
    'problem_type',          # 문제 유형
    'problem_source',        # 문제 출처
    'generation_model',      # 문제 생성 모델
    'pass_rate_72b_tir',     # GPT-4 72B 기준 정답률
    'problem',               # 실제 수학 문제 본문
    'generated_solution',    # CoT 방식의 모델 풀이
    'inference_mode'         # 추론 방식 (e.g. greedy, sampling 등)
]

for i in range(1):
    for feature in features:
        print(f"[{feature}]:: {reasoning_dataset[i][feature]}")

[expected_answer]:: 14
[problem_type]:: has_answer_extracted
[problem_source]:: aops_c4_high_school_math
[generation_model]:: DeepSeek-R1
[pass_rate_72b_tir]:: 0.96875
[problem]:: Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.
[generated_solution]:: <think>
Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.

First, let me write down the equation again to make sure I have it right:

√(x² + 165) - √(x² - 52) = 7.

Okay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:

√(x² + 165) = 7 + √(x² - 52).

Now, if I square both sides, maybe I can get rid of the square roots. Let's do that:

(√(x² + 165))² = (7 + √(x² - 52))².

Simplifying the left side:

x² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))

[expected_answer]:: 14  
[problem_type]:: has_answer_extracted  
[problem_source]:: aops_c4_high_school_math   
[generation_model]:: DeepSeek-R1  
[pass_rate_72b_tir]:: 0.96875  
[problem]:: Given $\sqrt{x^2+165}-\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.  
[generated_solution]::  
 <think>
Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.

First, let me write down the equation again to make sure I have it right:

√(x² + 165) - √(x² - 52) = 7.

Okay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:

√(x² + 165) = 7 + √(x² - 52).

Now, if I square both sides, maybe I can get rid of the square roots. Let's do that:

(√(x² + 165))² = (7 + √(x² - 52))².

Simplifying the left side:

x² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².

The right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).

So putting it all together:

x² + 165 = 49 + 14√(x² - 52) + x² - 52.

Hmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:

165 = 49 + 14√(x² - 52) - 52.

Simplify the constants on the right:

49 - 52 is -3, so:

165 = -3 + 14√(x² - 52).

Now, add 3 to both sides to isolate the radical term:

165 + 3 = 14√(x² - 52).

So 168 = 14√(x² - 52).

Divide both sides by 14:

168 / 14 = √(x² - 52).

12 = √(x² - 52).

Now, square both sides again to eliminate the square root:

12² = x² - 52.

144 = x² - 52.

Add 52 to both sides:

144 + 52 = x².

196 = x².

So x = √196 = 14.

But wait, since the problem states that x is positive, we only take the positive root. So x = 14.

But hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.

Let's plug x = 14 back into the original equation:

√(14² + 165) - √(14² - 52) = ?

Calculate each term:

14² is 196.

So first radical: √(196 + 165) = √361 = 19.

Second radical: √(196 - 52) = √144 = 12.

So 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.

Therefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.
</think>To solve the equation \(\sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7\) for positive \(x\), we proceed as follows:

1. Start with the given equation:
   \[
   \sqrt{x^2 + 165} - \sqrt{x^2 - 52} = 7
   \]

2. Isolate one of the square roots by moving \(\sqrt{x^2 - 52}\) to the right side:
   \[
   \sqrt{x^2 + 165} = 7 + \sqrt{x^2 - 52}
   \]

3. Square both sides to eliminate the square root on the left:
   \[
   (\sqrt{x^2 + 165})^2 = (7 + \sqrt{x^2 - 52})^2
   \]
   Simplifying both sides, we get:
   \[
   x^2 + 165 = 49 + 14\sqrt{x^2 - 52} + (x^2 - 52)
   \]

4. Combine like terms on the right side:
   \[
   x^2 + 165 = x^2 - 52 + 49 + 14\sqrt{x^2 - 52}
   \]
   Simplifying further:
   \[
   x^2 + 165 = x^2 - 3 + 14\sqrt{x^2 - 52}
   \]

5. Subtract \(x^2\) from both sides:
   \[
   165 = -3 + 14\sqrt{x^2 - 52}
   \]

6. Add 3 to both sides to isolate the term with the square root:
   \[
   168 = 14\sqrt{x^2 - 52}
   \]

7. Divide both sides by 14:
   \[
   12 = \sqrt{x^2 - 52}
   \]

8. Square both sides again to eliminate the square root:
   \[
   12^2 = x^2 - 52
   \]
   Simplifying:
   \[
   144 = x^2 - 52
   \]

9. Add 52 to both sides to solve for \(x^2\):
   \[
   196 = x^2
   \]

10. Take the positive square root (since \(x\) is positive):
    \[
    x = \sqrt{196} = 14
    \]

11. Verify the solution by substituting \(x = 14\) back into the original equation:
    \[
    \sqrt{14^2 + 165} - \sqrt{14^2 - 52} = \sqrt{196 + 165} - \sqrt{196 - 52} = \sqrt{361} - \sqrt{144} = 19 - 12 = 7
    \]
    The solution checks out.

Thus, the only positive solution is:
\[
\boxed{14}
\]
[inference_mode]:: cot

In [11]:
non_reasoning_dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

In [12]:
for i in range(1):
    for feature in ['conversations', 'source', 'score']:
        print(f"[{feature}]:: {non_reasoning_dataset[i][str(feature)]}")


[conversations]:: [{'from': 'human', 'value': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.'}, {'from': 'gpt', 'v

In [13]:
print(len(non_reasoning_dataset[0]['conversations'][0]['value']))
print(non_reasoning_dataset[0]['conversations'][0]['value'])

927
Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. 

Furthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.

Finally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.


### 데이터 세트를 대화 형식(templat)으로 변환
#### 추론 데이터 세트를 대화 형식(templet)으로 변환

In [14]:
def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role" : "user",      "content" : problem},
            {"role" : "assistant", "content" : solution}, ])
    return { "conversations": conversations, }

In [15]:
reasoning_conversations = tokenizer.apply_chat_template(
    reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
    tokenize = False, )

Map:   0%|          | 0/19252 [00:00<?, ? examples/s]

In [16]:
reasoning_conversations[0]

"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² 

<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>\n\nTo solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n   \\[\n   \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n   \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n   \\[\n   \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n   \\]\n\n3. Square both sides to eliminate the square root on the left:\n   \\[\n   (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n   \\]\n   Simplifying both sides, we get:\n   \\[\n   x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n   \\]\n\n4. Combine like terms on the right side:\n   \\[\n   x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n   \\]\n   Simplifying further:\n   \\[\n   x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n5. Subtract \\(x^2\\) from both sides:\n   \\[\n   165 = -3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n   \\[\n   168 = 14\\sqrt{x^2 - 52}\n   \\]\n\n7. Divide both sides by 14:\n   \\[\n   12 = \\sqrt{x^2 - 52}\n   \\]\n\n8. Square both sides again to eliminate the square root:\n   \\[\n   12^2 = x^2 - 52\n   \\]\n   Simplifying:\n   \\[\n   144 = x^2 - 52\n   \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n   \\[\n   196 = x^2\n   \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n    \\[\n    x = \\sqrt{196} = 14\n    \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n    \\[\n    \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n    \\]\n    The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]<|im_end|>\n

비추론 데이터 세트를 가져와서 대화 형식으로 변환    
먼저 Unsloth의 standardize_sharegpt 함수를 사용하여 데이터 세트의 형식을 수정해야 함

In [17]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(non_reasoning_dataset)

non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset["conversations"],
    tokenize = False,
)

Unsloth: Standardizing formats (num_proc=12):   0%|          | 0/100000 [00:00<?, ? examples/s]

In [18]:
non_reasoning_conversations[0]

'<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</th

<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nBoolean operators are logical operators used in programming to manipulate boolean values. They operate on one or more boolean operands and return a boolean result. The three main boolean operators are "AND" (&&), "OR" (||), and "NOT" (!).\n\nThe "AND" operator returns true if both of its operands are true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) and (y < 20)  # This expression evaluates to True\n```\n\nThe "OR" operator returns true if at least one of its operands is true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) or (y < 20)  # This expression evaluates to True\n```\n\nThe "NOT" operator negates the boolean value of its operand. It returns true if the operand is false, and false if the operand is true. For example:\n\n```python\nx = 5\nresult = not (x > 10)  # This expression evaluates to True\n```\n\nOperator precedence refers to the order in which operators are evaluated in an expression. It ensures that expressions are evaluated correctly. In most programming languages, logical AND has higher precedence than logical OR. For example:\n\n```python\nresult = True or False and False  # This expression is evaluated as (True or (False and False)), which is True\n```\n\nShort-circuit evaluation is a behavior where the second operand of a logical operator is not evaluated if the result can be determined based on the value of the first operand. In short-circuit evaluation, if the first operand of an "AND" operator is false, the second operand is not evaluated because the result will always be false. Similarly, if the first operand of an "OR" operator is true, the second operand is not evaluated because the result will always be true.\n\nIn programming languages that support short-circuit evaluation natively, you can use it to improve performance or avoid errors. For example:\n\n```python\nif x != 0 and (y / x) > 10:\n    # Perform some operation\n```\n\nIn languages without native short-circuit evaluation, you can implement your own logic to achieve the same behavior. Here\'s an example in pseudocode:\n\n```\nif x != 0 {\n    if (y / x) > 10 {\n        // Perform some operation\n    }\n}\n```\n\nTruthiness and falsiness refer to how non-boolean values are evaluated in boolean contexts. In many programming languages, non-zero numbers and non-empty strings are considered truthy, while zero, empty strings, and null/None values are considered falsy.\n\nWhen evaluating boolean expressions, truthiness and falsiness come into play. For example:\n\n```python\nx = 5\nresult = x  # The value of x is truthy, so result is also truthy\n```\n\nTo handle cases where truthiness and falsiness are implemented differently across programming languages, you can explicitly check the desired condition. For example:\n\n```python\nx = 5\nresult = bool(x)  # Explicitly converting x to a boolean value\n```\n\nThis ensures that the result is always a boolean value, regardless of the language\'s truthiness and falsiness rules.<|im_end|>\n

Dataset 길이 확인  

In [19]:
print(len(reasoning_conversations))
print(len(non_reasoning_conversations))

19252
100000


비추론 데이터 세트는 훨씬 더 깁니다.   
모델이 몇 가지 추론 기능을 유지하기를 원하지만 구체적으로 채팅 모델을 원한다고 가정하고,  
채팅 전용 데이터의 비율을 정의해야 함   
목표는 두 데이터 집합의 일부 혼합을 정의하는 것   
75%의 추론과 25%의 채팅 기반을 선택 해봄  

### 추론/비추론 비율 조정
추론 데이터 세트를 75%(또는 100% - chat_percentage)로 샘플링  

In [20]:
import pandas as pd
chat_percentage = 0.25
non_reasoning_subset = pd.Series(non_reasoning_conversations)
non_reasoning_subset = non_reasoning_subset.sample(
    int(len(reasoning_conversations)*(chat_percentage/(1 - chat_percentage))),
    random_state = 2407, )
print(len(reasoning_conversations))
print(len(non_reasoning_subset))
print(len(non_reasoning_subset) / (len(non_reasoning_subset) +
                                   len(reasoning_conversations)))

19252
6417
0.2499902606256574


 ### 두 데이터 세트를 결합

In [21]:
data = pd.concat([
    pd.Series(reasoning_conversations),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
combined_dataset = combined_dataset.shuffle(seed = 3407)

## Train the model
Huggingface TRL의 SFTTrainer를 사용   
속도를 높이기 위해 30step만 수행
전체 실행을 위해 num_train_epochs=1, max_steps=None   

SFTrainer : 최신 생성모델기반 LLM 및 RLHF에 최적화  
Trainer : BERT류에 최적화

### Trainer 생성

In [22]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,
    eval_dataset = None,                 # 필요시
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # batch size가 너무 작을때
        warmup_steps = 5,
        # num_train_epochs = 1,          # full training
        max_steps = 30,                  # 최소한으로 학습
        learning_rate = 2e-4,            # 2e-5 까지(천천히 학습)
        logging_steps = 1,
        optim = "adamw_8bit",            # AdamW(weight decay 유리)
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ), )

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/25669 [00:00<?, ? examples/s]

In [23]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# GPU = NVIDIA L4. Max memory = 22.161 GB.
# 10.896 GB of memory reserved.

GPU = NVIDIA L4. Max memory = 22.161 GB.
10.998 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

### Training

In [24]:
%%time
trainer_stats = trainer.train()
# Wall time: 16min 34s

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 25,669 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 128,450,560/14,000,000,000 (0.92% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.5506
2,0.5403
3,0.519
4,0.6972
5,0.6226
6,0.5596
7,0.5609
8,0.4369
9,0.4807
10,0.426


CPU times: user 9min 44s, sys: 6min 27s, total: 16min 12s
Wall time: 16min 17s


In [25]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

# 992.1148 seconds used for training.
# 16.54 minutes used for training.
# Peak reserved memory = 13.883 GB.
# Peak reserved memory for training = 2.987 GB.
# Peak reserved memory % of max memory = 62.646 %.
# Peak reserved memory for training % of max memory = 13.479 %.

975.4006 seconds used for training.
16.26 minutes used for training.
Peak reserved memory = 13.887 GB.
Peak reserved memory for training = 2.889 GB.
Peak reserved memory % of max memory = 62.664 %.
Peak reserved memory for training % of max memory = 13.036 %.


## Inference
Qwen-3 권장 설정(enable thinking)   
* temperature = 0.6, top_p = 0.95, top_k = 20  

일반적인 채팅 권장 설정(diasble thinking)  
* temperature = 0.7, top_p = 0.8, top_k = 20

### Disable thinking
enable_thinking = False

In [26]:
query = "반도체 8대 공정 중에서 포토리소그래피(Photolithography) 단계에 대해서 자세히 설명해 줘."
steps = [query]
prompt = "\n".join(steps)
prompt

'반도체 8대 공정 중에서 포토리소그래피(Photolithography) 단계에 대해서 자세히 설명해 줘.'

In [27]:
query = "반도체 8대 공정 중에서 포토리소그래피(Photolithography) 단계에 대해서 자세히 설명해 줘."
steps = [query]
prompt = "\n".join(steps)

messages = [
#    {"role" : "user", "content" : "이식을 풀어줘. (x + 2)^2 = 0."} ]
    {"role" : "user", "content" : prompt} ]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, ) # Disable thinking

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(
        text,
        return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True), )

반도체 8대 공정 중에서 포토리소그래피(Photolithography) 단계는 반도체 제조 과정에서 가장 중요한 단계 중 하나로, 회로 패턴을 기판에 전달하는 데 사용됩니다. 이 과정은 반도체 기판 위에 회로 패턴을 형성하여 전자기기의 기능을 결정하는 데 중요한 역할을 합니다.

포토리소그래피의 주요 단계는 다음과 같습니다:

1. **기판 준비**: 반도체 기판 위에 포토리소그래피를 위한 준비 작업이 이루어집니다. 이 과정에서는 기판의 표면을 깨끗하게 하고, 포토리소그래피에 필요한 층을 형성합니다.

2. **광감광막 적용**: 기판 위에 광감광막(Photoresist)이 적용됩니다. 광감광막은 빛에 반응하여 패턴을 형성하는 특성을 가진 층입니다.

3. **광원 조사**: 광감광막이 적용된 기판에 광원을


### Enable thinking
enable_thinking = True

In [28]:
messages = [
    {"role" : "user", "content" : "이식을 풀어줘. (x + 2)^2 = 0."}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024,   # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True), )

<think>

Okay, let's see. I need to solve the equation (x + 2)^2 = 0. Hmm, so the equation is a square of a binomial equal to zero. I remember that when you have something squared equal to zero, the only solution is when the inside of the square is zero. Because if you square any real number, it's either positive or zero. So, if the square is zero, the number inside must be zero. 

Wait, let me think again. Let's take it step by step. The equation is (x + 2)^2 = 0. To solve for x, I need to get rid of the square. How do I do that? Well, taking the square root of both sides. So, if I take the square root of both sides, the left side becomes x + 2, and the right side becomes sqrt(0). But sqrt(0) is 0. So, x + 2 = 0. Then, solving for x, I subtract 2 from both sides. That gives x = -2. 

But wait, when I take the square root, I should remember that the square root of a square can be positive or negative. But in this case, since the right side is 0, which is neither positive nor negative, 

<think>

Okay, so I need to solve the equation (x + 2)^2 = 0. Hmm, let's see. First, I remember that when you have something squared equals zero, the only solution is when the inside of the square is zero. Because any real number squared is non-negative, and the only way it can be zero is if the number itself is zero. So, applying that here, (x + 2)^2 = 0 means that x + 2 must be zero. So then, solving for x, I just subtract 2 from both sides. That would give x = -2. Wait, but since it's squared, does that mean there are two solutions? No, wait, if it's squared and equals zero, there's only one solution because both roots are the same. So x = -2 is the only solution here. Let me check that again. If I plug x = -2 back into the equation, (-2 + 2)^2 = 0^2 = 0, which matches the original equation. So yeah, that's correct. There's only one solution, which is x = -2.
</think>

(x + 2)^2 = 0의 해를 구하기 위해 다음과 같은 단계를 따릅니다:

1. 양변의 제곱근을 취합니다:
   \[
   \sqrt{(x + 2)^2} = \sqrt{0}
   \]
   이는 \( |x + 2| = 0 \)와 동일합니다.

2. 절대값의 정의에 따라 \( x + 2 = 0 \)을 얻습니다.

3. 방정식을 풀어 x를 구합니다:
   \[
   x = -2
   \]

따라서, 주어진 방정식의 해는 \( x = -2 \)입니다.<|im_end|>

In [32]:
messages = [
    {"role" : "user", "content" : "일본의 US 스틸 인수가 한국 경제에 미치는 영향"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = True, # Disable thinking
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024,   # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True), )

<think>
</think>

일본의 US 스틸 인수가 한국 경제에 미치는 영향은 여러 측면에서 고려해야 한다. 이 인수는 단순히 기업 간의 인수합병(M&A)을 넘어서 글로벌 산업 구조 조정과 경제적 파급 효과를 포함한다. 다음은 주요 영향을 분석한 내용이다.

### 1. **철강 산업 구조 조정**
- **시장 경쟁 구도 변화**: US 스틸의 일본 진출은 글로벌 철강 시장에서의 경쟁 구도를 재편할 수 있다. 한국 철강 업체들은 일본 기업과의 경쟁에서 기술력과 가격 경쟁력을 강화해야 한다.
- **생산 설비 효율성**: US 스틸은 일본에 대규모 철강 생산 설비를 확장할 가능성이 있다. 이는 일본 내부의 철강 수요를 충족시키는 동시에, 한국과의 무역 관계에 영향을 줄 수 있다.

### 2. **무역 및 투자 관계**
- **한국-일본 무역**: 한국은 일본에 철강 제품을 수출하는 주요 국가 중 하나이다. US 스틸의 일본 진출은 한국 철강 수출에 경쟁을 유발할 수 있다.
- **투자 유치**: US 스틸의 일본 진출은 한국에 대한 투자 유치를 줄일 수 있다. 그러나 반대로, 한국 철강 업체들이 일본 시장에 대한 관심을 높일 수 있다.

### 3. **기술 및 혁신**
- **기술 이전**: US 스틸은 일본에 최신 철강 제조 기술과 생산 설비를 도입할 가능성이 있다. 이는 한국 철강 업체들이 기술 혁신을 통해 경쟁력을 유지하거나 강화해야 한다는 압박을 주는 요인이다.
- **R&D 투자**: 한국 철강 업체들은 기술 혁신을 위해 R&D 투자를 확대해야 할 필요성이 있다.

### 4. **노동 시장 및 고용**
- **고용 변화**: US 스틸의 일본 진출은 일본 내 철강 업계의 고용 구조를 변화시킬 수 있다. 이는 한국 철강 업계의 고용에 간접적인 영향을 줄 수 있다.
- **노동자 재교육**: 한국 철강 업체들은 기술 변화에 대응하기 위해 노동자들의 재교육을 진행해야 할 필요성이 있다.

### 5. **환경 및 지속 가능성**
- **친환경 기술

## Saving, loading finetuned models


모델을 LoRA adapter로 저장하려면 save_pretrained 사용  
[참고] 이렇게 하면 전체 모델이 아닌 LoRA adapter만 저장됨  
16비트 또는 GGUF에 저장은 아래 참조    

### LoRA Adapter만 저장

In [29]:
model.save_pretrained("lora_model")    # LoRA Adapter parameter만 저장
tokenizer.save_pretrained("lora_model")# tokenizer 저장

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

### LoRA adapters 읽어오기   
**memory부족하면 실패함** session restart하고 실행  

In [31]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # Base모델은 자동으로 load
    max_seq_length = 2048,     #
    load_in_4bit = True, )     #

==((====))==  Unsloth 2025.6.2: Fast Qwen3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

### Saving to float16 for VLLM

float16 저장 지원(모델 전체 저장)   
* float16의 경우 merged_16bit 선택
* int4인 경우 merged_4bit 선택  

모델 병합(merge)가 실패할 경우  
* lora 어댑터를 저장  

In [None]:
# Merge to 16bit(float16) # 모델이 커서(5GB) 시간이 많이 소요됨
model.save_pretrained_merged("model_m16", tokenizer, save_method = "merged_16bit",)
# Merge to 4bit(int4)     # 모델이 커서 시간이 많이 소요됨
model.save_pretrained_merged("model_m4", tokenizer, save_method = "merged_4bit",)

# Just LoRA adapters
model.save_pretrained_merged("model_lora", tokenizer, save_method = "lora",)

### GGUF / llama.cpp Conversion
GGUF(llama.cpp)format으로 저장  
* 기본적으로 q8_0 저장    
* q4_k_m 등 모든 방법을 허   
* save_pretrained_gguf() 사용   

지원되는 quant 방법(전체 목록은 Wiki 페이지의 참조) [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):

* `q8_0` - 빠른 변환. 리소스 사용량이 많지만 일반적으로 쓸만함
* `q4_k_m` - 권장, attention.wv 및 feed_forward.w2 텐서의 절반은 Q6_K를 적용, 나머지는 Q4_K 사용   
* `q5_k_m` - 권장, attention.wv 및 feed_forward.w2 텐서의 절반은 Q6_K를 적용, 나머지는 Q5_K 사용   
>_m(mixed): attention.wv 및 feed_forward.w2는 상대적으로 민감한 부분: 정밀도 높이자  

In [None]:
# Save to 8bit Q8_0
model.save_pretrained_gguf("model_q8_gguf", tokenizer,)
# Save to q4_k_m GGUF
model.save_pretrained_gguf("model_q4_k_m_gguf", tokenizer, quantization_method = "q4_k_m")

# Save to 16bit GGUF
# model.save_pretrained_gguf("model_f16_gguf", tokenizer, quantization_method = "f16")