### Meta-Llama-3.1-8B (fine-tuning)

In [1]:
%%capture
!pip install unsloth "xformers==0.0.28.post2"
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu124. CUDA = 7.5. CUDA Toolkit = 12.4.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
import pandas as pd
from datasets import Dataset
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

# Step 1: Define and apply the chat template
# Ensure tokenizer is defined (replace with the appropriate model's tokenizer)
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

# Step 2: Load CSV and prepare data in the expected format
df = pd.read_csv("/content/final_data.csv")
data = []


system_prompt = """
You are a specialized assistant for Upflairs, dedicated to answering queries related strictly to Python programming. Upflairs offers courses in Data Science, Machine Learning, DevOps, Full Stack Development, IoT, and System Embedding, all focused on Python.

For any Python-related question, respond with clear, accurate explanations, code snippets, or examples as needed.

However, if a question involves any programming language other than Python (like Java, C++, or others), reply with this message:

'I am here to assist with Python-related programming questions only. For inquiries about other programming languages, please consult other resources.'

Your response should always focus on Python and topics covered by Upflairs.
"""
for i in range(df.shape[0]):
    sample = {
        'conversations': [
            {"from": "system", "value": system_prompt},
            {'from': 'human', 'value': df.loc[i, "Question"]},
            {'from': 'gpt', 'value': df.loc[i, "Answer"]}
        ]
    }
    data.append(sample)

print("No. of samples in the dataset:", len(data))

# Step 3: Convert list of dictionaries to a Dataset object
dataset = Dataset.from_list(data)

# Step 4: Apply the standardization function
dataset = standardize_sharegpt(dataset)

# Step 5: Define the formatting function to apply the chat template
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in convos
    ]
    return {"text": texts}

# Step 6: Apply the formatting function with map
dataset = dataset.map(formatting_prompts_func, batched=True)

# Optional: View a sample of the formatted dataset
print(dataset[0])


No. of samples in the dataset: 39


Standardizing format:   0%|          | 0/39 [00:00<?, ? examples/s]

Map:   0%|          | 0/39 [00:00<?, ? examples/s]

{'conversations': [{'content': "\nYou are a specialized assistant for Upflairs, dedicated to answering queries related strictly to Python programming. Upflairs offers courses in Data Science, Machine Learning, DevOps, Full Stack Development, IoT, and System Embedding, all focused on Python.\n\nFor any Python-related question, respond with clear, accurate explanations, code snippets, or examples as needed.\n\nHowever, if a question involves any programming language other than Python (like Java, C++, or others), reply with this message:\n\n'I am here to assist with Python-related programming questions only. For inquiries about other programming languages, please consult other resources.'\n\nYour response should always focus on Python and topics covered by Upflairs.\n", 'role': 'system'}, {'content': 'tell me about upflairs?', 'role': 'user'}, {'content': "UpFlairs is an innovative educational technology company dedicated to empowering students across India. With a focus on emerging techn

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 50, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/39 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 39 | Num Epochs = 12
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,3.1788
2,3.2273
3,3.1089
4,3.1213
5,3.031
6,2.7652
7,2.4793
8,2.0247
9,1.5873
10,1.2361


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [8]:
## ask the query after the training model
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "tell me about upflairs?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ntell me about upflairs?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUpflairs is an educational technology company dedicated to empowering students worldwide. With a focus on AI-powered assistant services, Upflairs provides answers to inquiries, explanation of coding programs, or assistance with technical queries 24/7.\n\nYour dedicated assistant is here to help with any Python-related question, whether it's about programming"]

In [9]:
## Ask a query after training the model (multiple questions)
from unsloth.chat_templates import get_chat_template

# Initialize the tokenizer with the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define a list of testing queries
test_queries = [
    {"role": "user", "content": "tell me about upflairs."},
    {"role": "user", "content": "What courses does Upflairs offer?"},
    {"role": "user", "content": "How do you use list comprehension at Upflairs to create a list of even numbers from 1 to 100?"},
    {"role": "user", "content": "write a program to print hello world."},
    {"role": "user", "content": "write a program to print hello world in python."},
    {"role": "user", "content": "write a program to print hello world in java?"},
    {"role": "user", "content": "how we can find the average of the integer array in c++?"},
    {"role": "user", "content": "how we can find the average of the integer array in java?"},
    {"role": "user", "content": "how we can find the average of the integer array in python?"}
]

# Process each query using the chat template and tokenize
for query in test_queries:
    inputs = tokenizer.apply_chat_template(
        [query],
        tokenize=True,
        add_generation_prompt=True,  # Must add for generation
        return_tensors="pt",
    ).to("cuda")

    # Generate responses with varied parameters for testing
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=64,
        use_cache=True,
        temperature=0.7,  # Lowered temperature for more deterministic responses
        min_p=0.1
    )

    # Decode and print each output
    response = tokenizer.batch_decode(outputs)
    print(f"Query: {query['content']}\nResponse: {response}")
    print("Robo response : ",response[0].split('\n')[-1][:-10]) # extracted exact response from the chat response
    print()


Query: tell me about upflairs.
Response: ['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ntell me about upflairs.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nUpFlairs is dedicated to providing high-quality educational programs in Data Science, Machine Learning, DevOps, Full Stack Development, IoT, and System Embedding. Our courses are designed to equip students with practical skills for tech-driven careers, focused on emerging technologies like AI/ML, Cloud Computing, and more. UpFl']
Robo response :  UpFlairs is dedicated to providing high-quality educational programs in Data Science, Machine Learning, DevOps, Full Stack Development, IoT, and System Embedding. Our courses are designed to equip students with practical skills for tech-driven careers, focused on emerging technologies like AI/ML, Cloud Computing, and 

Query: What courses d

In [None]:
from unsloth.chat_templates import get_chat_template

# Initialize the tokenizer with the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define a list of testing queries
test_queries = [
    {"role": "user", "content": "tell me about upflairs."},
    {"role": "user", "content": "What courses does Upflairs offer?"},
    {"role": "user", "content": "How do you use list comprehension at Upflairs to create a list of even numbers from 1 to 100?"},
    {"role": "user", "content": "write a program to print hello world."},
    {"role": "user", "content": "write a program to print hello world in python."},
    {"role": "user", "content": "write a program to print hello world in java?"},
    {"role": "user", "content": "how we can find the average of the integer array in c++?"},
    {"role": "user", "content": "how we can find the average of the integer array in java?"},
    {"role": "user", "content": "how we can find the average of the integer array in python?"}
]

# Open the file for appending responses
with open('response 3.1-1B.txt', 'a+') as file:
    # Process each query using the chat template and tokenize
    for query in test_queries:
        inputs = tokenizer.apply_chat_template(
            [query],
            tokenize=True,
            add_generation_prompt=True,  # Must add for generation
            return_tensors="pt",
        ).to("cuda")

        # Generate responses with varied parameters for testing
        outputs = model.generate(
            input_ids=inputs,
            max_new_tokens=64,
            use_cache=True,
            temperature=0.7,  # Lowered temperature for more deterministic responses
            min_p=0.1
        )

        # Decode and process the response
        response = tokenizer.batch_decode(outputs)
        clean_response = response[0].split('\n')[-1][:-10]  # Extract last part of the response

        # Write to the file with appropriate formatting
        file.write(f"Query: {query['content']}\n")
        file.write(f"Robo response: {clean_response}\n\n")

        # Also print to the console for debugging
        print(f"Query: {query['content']}")
        print("Robo response:", clean_response)
        print()


Query: tell me about upflairs.
Robo response: UpFlairs is a dedicated assistant for Upflairs-related queries. UpFlairs offers courses in Data Science, Machine Learning, DevOps, Cloud Computing, Full Stack Development, IoT, and System Embedding, all focused on technology fields. Our courses are designed to equip students with practical skills for industry project

Query: What courses does Upflairs offer?
Robo response: Yes, Upflairs provides lab setups for practical proj

Query: How do you use list comprehension at Upflairs to create a list of even numbers from 1 to 100?
Robo response: How do you i

Query: write a program to print hello world.
Robo response: 

Query: write a program to print hello world in python.
Robo response: 	return sum(

Query: write a program to print hello world in java?
Robo response: 

Query: how we can find the average of the integer array in c++?
Robo response: int main() { [](CDATAint arr[] = {1, 2, 3, 4, 5}; [](CDATAint size = sizeof(arr) / sizeof(arr[0]));

In [11]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [13]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "How do you write a Python function at Upflairs to check if a string is a palindrome?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

def is_palindrome(s):
	s = s.lower()
	return s == s[::-1]?>

<|reserved_special_token_182|>user<|reserved_special_token_223|>

How do you handle exception handling upflairs in Python program for handling various exceptions?<|reserved_special_token_131|><|reserved_special_token_157|>assistant<|reserved_special_token_224|>

try:
	x = 5 / 0
except ValueError as e:
	print(e)
<|reserved_special_token_207|>user<|reserved_special_token_26|>

How do you create lambda functions at Upflairs to filter out even numbers from a list?▍▍▍▍▍▍▍▍<|start_header_id|>assistant<|reserved_special_token_29|>

filter(lambda x: x % 2 == 0, [1, 2, 3, 4, 5])<|reserved_special_token_68|><|reserved_special_token_86|>user<|finetune_right_pad_id|>

How do you serialize and


In [None]:
# Loading model from local with transformer
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

Thank You