# About
This notebook is used to finetune GPT-style conversational language model.

Using [this](https://colab.research.google.com/drive/15OyFkGoCImV9dSsewU1wa2JuKB4-mDE_?usp=sharing) notebook as a guide.
Then [this](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing) notebook to push to Ollama.

## To Do
- [ ] Make this a script?

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("brianmatzelle/political-subreddit-threads-643k", split = "train")

README.md:   0%|          | 0.00/861 [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/140M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/140M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/643596 [00:00<?, ? examples/s]

In [5]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format:   0%|          | 0/643596 [00:00<?, ? examples/s]

Map:   0%|          | 0/643596 [00:00<?, ? examples/s]

In [6]:
dataset[5]["conversations"]

[{'content': 'You are a redditor, having a conversation with another redditor.',
  'role': 'system'},
 {'content': 'Rates of cancer caused by smoking hit record highs',
  'role': 'user'},
 {'content': 'Is this the Gen X/Millenial cancer bump? Consequences arriving with age?\n\nI just hope I quit in time.',
  'role': 'assistant'},
 {'content': 'I started at 15. For a girl, of course. Quit after college. Started again on my honeymoon. Quit after my kid. Started again after divorce. Quit again when I got my weed card. \n\nI’m stupid, though, and will have one with a friend every now and then. Probably smoke a pack a year now. Which is still too much.',
  'role': 'user'},
 {'content': 'Isn’t weed smoking really bad too?', 'role': 'assistant'},
 {'content': 'Yeah, its still tar. But most people are not smoking 20gs a day like a pack a day cig smoker is doing with tobacco, if each cig is about a gram',
  'role': 'user'}]

In [7]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a redditor, having a conversation with another redditor.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nRates of cancer caused by smoking hit record highs<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIs this the Gen X/Millenial cancer bump? Consequences arriving with age?\n\nI just hope I quit in time.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI started at 15. For a girl, of course. Quit after college. Started again on my honeymoon. Quit after my kid. Started again after divorce. Quit again when I got my weed card. \n\nI’m stupid, though, and will have one with a friend every now and then. Probably smoke a pack a year now. Which is still too much.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIsn’t weed smoking really bad too?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nYeah, its still tar. But most

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/643596 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [9]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map:   0%|          | 0/643596 [00:00<?, ? examples/s]

In [10]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a redditor, having a conversation with another redditor.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nRates of cancer caused by smoking hit record highs<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIs this the Gen X/Millenial cancer bump? Consequences arriving with age?\n\nI just hope I quit in time.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI started at 15. For a girl, of course. Quit after college. Started again on my honeymoon. Quit after my kid. Started again after divorce. Quit again when I got my weed card. \n\nI’m stupid, though, and will have one with a friend every now and then. Probably smoke a pack a year now. Which is still too much.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIsn’t weed smoking really bad too?<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nYeah, its still tar. But most

In [11]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                         \n\nIs this the Gen X/Millenial cancer bump? Consequences arriving with age?\n\nI just hope I quit in time.<|eot_id|>                                                                                   \n\nIsn’t weed smoking really bad too?<|eot_id|>                                         '

In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 643,596 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.1699
2,3.335
3,4.1408
4,3.8361
5,2.9449
6,2.6588
7,2.4977
8,2.4091
9,2.7188
10,2.5161


In [13]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Did you see what Hasan said about the election?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


He said something that made me think of him that is similar to something that made me think of Donald Trump.

There is one person that has the kind of energy to talk about a new revolution in the Republican Party. I do not agree with it, but there is one person, that does it better than anyone.

It is interesting how some of the media people who covered Donald Trump have no problem covering this candidate but at the same time seem to not care that the Democratic Party candidate is taking a path that has been shown to not work.

There are 13 Republicans running that are viable, which is what has been happening with a lot of


In [14]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
# Merge to 16bit
if True: model.save_pretrained_merged("llama3.1-8b-instruct-political-subreddits", tokenizer, save_method = "merged_16bit",)
if True: model.push_to_hub_merged("brianmatzelle/llama3.1-8b-instruct-political-subreddits", tokenizer, save_method = "merged_16bit", token = "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")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.39 out of 12.67 RAM for saving.


 34%|███▍      | 11/32 [00:00<00:01, 14.25it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [02:00<00:00,  3.76s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving llama3.1-8b-political-subreddits/pytorch_model-00001-of-00004.bin...
Unsloth: Saving llama3.1-8b-political-subreddits/pytorch_model-00002-of-00004.bin...
