To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install wandb -qU

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m6.9/7.1 MB[0m [31m207.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.1/7.1 MB[0m [31m196.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m97.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Log in to your W&B account
import wandb

# Use wandb-core, temporary for wandb's new backend
wandb.require("core")

In [None]:
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
wandb.init(project="TextExpansions_NLP", name="LLAMA 8B run_5")

[34m[1mwandb[0m: Currently logged in as: [33mtvallabh[0m ([33mtvallabh-university-of-chicago[0m). Use [1m`wandb login --relogin`[0m to force relogin


# Data Preprocessing

In [None]:
import pandas as pd
import numpy as np
import torch
csv_expansions = '/content/drive/My Drive/NLP Final Project Data/df_combined_expansions.csv'
df_expansions = pd.read_csv(csv_expansions)

In [None]:
df_expansions

Unnamed: 0,notes,expanded_content
0,Algorithms are step-by-step procedures or form...,# Algorithms: The Foundation of Computer Progr...
1,Data structures are methods for organizing and...,# Data Structures: Organizing and Storing Data...
2,"Sorting algorithms, such as quicksort and merg...",# Sorting Algorithms: An Overview\n\nSorting a...
3,Searching algorithms enable the identification...,# Searching Algorithms: Linear and Binary Sear...
4,Big O notation is a mathematical representatio...,# Big O Notation\n\nBig O notation is a mathem...
...,...,...
1495,Suffix trees can be constructed for multiple s...,# Suffix Trees for Multiple Strings\n\nSuffix ...
1496,One key operation on suffix trees is the const...,# Suffix Trees and Suffix Links\n\nSuffix tree...
1497,Suffix trees play a crucial role in the field ...,# Suffix Trees and Their Role in Data Compress...
1498,Visualization of suffix trees helps in underst...,# Visualization of Suffix Trees: Understanding...


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [None]:
def count_tokens(text, tokenizer):
    tokens = tokenizer(text, return_tensors="pt")["input_ids"].shape[1]
    return tokens

# Count tokens in 'notes' and 'expanded_content' columns
notes_token_counts = df_expansions['notes'].apply(lambda x: count_tokens(x, tokenizer))
expansions_token_counts = df_expansions['expanded_content'].apply(lambda x: count_tokens(x, tokenizer))

# Print the token counts
print("Notes Token Counts:", notes_token_counts)
print("Expansions Token Counts:", expansions_token_counts)

Notes Token Counts: 0       27
1       31
2       33
3       27
4       29
        ..
1495    23
1496    26
1497    28
1498    22
1499    27
Name: notes, Length: 1500, dtype: int64
Expansions Token Counts: 0       733
1       839
2       860
3       755
4       822
       ... 
1495    751
1496    743
1497    764
1498    687
1499    724
Name: expanded_content, Length: 1500, dtype: int64


In [None]:
# Count tokens in 'notes' and 'expanded_content' columns
df_expansions['notes_token_count'] = df_expansions['notes'].apply(lambda x: count_tokens(x, tokenizer))
df_expansions['expansions_token_count'] = df_expansions['expanded_content'].apply(lambda x: count_tokens(x, tokenizer))


In [None]:
# Descriptive statistics for 'notes_token_count'
notes_stats = df_expansions['notes_token_count'].describe()
print("Statistics for 'notes_token_count':")
print(notes_stats)

# Descriptive statistics for 'expansions_token_count'
expansions_stats = df_expansions['expansions_token_count'].describe()
print("\nStatistics for 'expansions_token_count':")
print(expansions_stats)

Statistics for 'notes_token_count':
count    1500.000000
mean       29.266667
std         4.025049
min        19.000000
25%        26.000000
50%        29.000000
75%        32.000000
max        52.000000
Name: notes_token_count, dtype: float64

Statistics for 'expansions_token_count':
count    1500.000000
mean      759.546667
std        52.927567
min       600.000000
25%       723.000000
50%       757.000000
75%       793.000000
max      1006.000000
Name: expansions_token_count, dtype: float64


In [None]:
# Filter out rows where 'notes' or 'expanded_content' have more than 1000 tokens as they are cut off
filtered_df = df_expansions[(df_expansions['notes_token_count'] <= 1000) & (df_expansions['expansions_token_count'] <= 1000)]

# Print the shape of the filtered dataframe to see how many rows were removed
print(f"Original dataframe shape: {df_expansions.shape}")
print(f"Filtered dataframe shape: {filtered_df.shape}")


Original dataframe shape: (1500, 4)
Filtered dataframe shape: (1499, 4)


In [None]:
instructions = """
    You are a computer science expert and a skilled writer.

    Craft detailed content about the given computer science subtopic for university-level lecture notes, targeting a total of about 500 words distributed over a few paragraphs.

    Begin with an introductory paragraph that lays the foundation of the subtopic. Follow this with detailed paragraphs focusing on the critical aspects of the subtopic. Include applications only if they are essential for understanding the concept; otherwise, concentrate on explaining the concept itself and its nuances.

    You can selectively, if necessary, use examples, tables in Markdown format to illustrate key points, ensuring that any code provided is concise and directly demonstrates the concept, otherwise you don't need to include it.

    Please also avoid overly detailed explanations of complex algorithms unless they are central to the subtopic. Do not go overboard with technical details that may overwhelm students.

    Let's try to avoid generating code unless its short and obvious, otherwise, focus on detailed explanations and if you use equations, please use inline HTML. Quick and simple inline equations can utilize HTML ampersand entity codes, such as:

        h<sub>&theta;</sub>(x) = &theta;<sub>o</sub> x + &theta;<sub>1</sub>x

    This method works in practically all Markdown and does not require any external libraries. Avoid using LaTeX. If you cannot express it in HTML, please avoid using equations. Unless the symbol is simple and can be represented in HTML and Markdown, avoid using those symbols.

    Let's try to avoid generating code unless its short and obvious, otherwise, focus on detailed explanations and if you use equations, please use LaTeX format.

    Maintain clear and concise language suitable for a 10th-grade reading level, using academic language where appropriate. Avoid overly technical jargon unless it is necessary for clarity.

    Also avoid your conclusion paragraph in the end since the content should be detailed throughout.

    The entire response must be in valid Markdown format and avoid the use of diagrams unless they can be effectively represented in Markdown. You must stay in our limit of 500 words.

    LaTeX is impossible to use in Markdown, so please use HTML for equations. Do not use LaTeX.

    Your input will always be a single computer science subtopic, and your output should not conclude with a summarizing paragraph but rather emphasize detailed explanation throughout.


    Now, please generate detailed content about the subtopic in Markdown:
    """

In [None]:
# Get the total token count for 'notes' and 'expanded_content'
total_notes_tokens = df_expansions['notes_token_count'].sum()
total_expansions_tokens = df_expansions['expansions_token_count'].sum()
total_instr_tokens = count_tokens(instructions, tokenizer)

# Print the total token counts
print(f"Total tokens in 'notes': {total_notes_tokens}")
print(f"Total tokens in 'expanded_content': {total_expansions_tokens}")
print(f"Total tokens in 'instructions': {total_instr_tokens}")

Total tokens in 'notes': 43900
Total tokens in 'expanded_content': 1139320
Total tokens in 'instructions': 491


In [None]:
from datasets import Dataset

# training_df = pd.DataFrame()
# Create combined prompts with a new line after instructions
combined_prompts = filtered_df.apply(
    lambda row: f"{instructions}\nEXAMPLE:\n\n###INPUT (Notes): \n {row['notes']}\n\n###OUTPUT (Expected Generations):\n {row['expanded_content']}", axis=1
)


In [None]:
# Split the training data:
from sklearn.model_selection import train_test_split

# Split the combined prompts into training, validation, and test sets
train_prompts, temp_prompts = train_test_split(combined_prompts, test_size=0.2, random_state=42)
val_prompts, test_prompts = train_test_split(temp_prompts, test_size=0.5, random_state=42)

# Create a dataset dictionary for training, validation, and testing
train_dataset_dict = {'text': train_prompts}
val_dataset_dict = {'text': val_prompts}
test_dataset_dict = {'text': test_prompts}

# Convert the dictionary to Dataset objects
train_dataset = Dataset.from_dict(train_dataset_dict)
val_dataset = Dataset.from_dict(val_dataset_dict)
test_dataset = Dataset.from_dict(test_dataset_dict)

# Print a sample combined prompt to verify
print(train_dataset['text'][0])
print(val_dataset['text'][0])
print(test_dataset['text'][0])


    You are a computer science expert and a skilled writer.

    Craft detailed content about the given computer science subtopic for university-level lecture notes, targeting a total of about 500 words distributed over a few paragraphs.

    Begin with an introductory paragraph that lays the foundation of the subtopic. Follow this with detailed paragraphs focusing on the critical aspects of the subtopic. Include applications only if they are essential for understanding the concept; otherwise, concentrate on explaining the concept itself and its nuances.

    You can selectively, if necessary, use examples, tables in Markdown format to illustrate key points, ensuring that any code provided is concise and directly demonstrates the concept, otherwise you don't need to include it.

    Please also avoid overly detailed explanations of complex algorithms unless they are central to the subtopic. Do not go overboard with technical details that may overwhelm students.

    Let's try to avoid

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=51eb57325365fb538748e6c9ca4579b9479f1d979c2472211381cc4d4adff981
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    # compute_metrics=compute_metrics,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 30,
        num_train_epochs = 5, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        # logging_steps = 0.1,
        logging_steps = 20,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        evaluation_strategy = "epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        report_to="wandb",
        run_name="LLAMA-8B_run_5",
    ),
)



Map (num_proc=2):   0%|          | 0/1199 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/150 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
6.457 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,199 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 750
 "-____-"     Number of trainable parameters = 167,772,160


Epoch,Training Loss,Validation Loss
1,0.6948,0.715897
2,0.6509,0.704014
3,0.5842,0.718662
4,0.4893,0.756261
5,0.4396,0.792841


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2954.8328 seconds used for training.
49.25 minutes used for training.
Peak reserved memory = 19.771 GB.
Peak reserved memory for training = 13.314 GB.
Peak reserved memory % of max memory = 49.972 %.
Peak reserved memory for training % of max memory = 33.652 %.


In [None]:
best_model = trainer.model

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
!cp -r /content/lora_model '/content/drive/My Drive/NLP Final Project Data'

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=max_seq_length)

tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

In [None]:
# After training, evaluate on the test set
test_results = trainer.evaluate(tokenized_test_dataset)
print(test_results)

# Log the test results to wandb
wandb.log({"test_results": test_results})
# Close wandb run
wandb.finish()

{'eval_loss': 0.6908238530158997, 'eval_runtime': 27.8086, 'eval_samples_per_second': 5.394, 'eval_steps_per_second': 0.683, 'epoch': 5.0}


VBox(children=(Label(value='0.047 MB of 0.047 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▃▂▃▅█▁
eval/runtime,▁▁▁▁▁█
eval/samples_per_second,█████▁
eval/steps_per_second,█████▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██████
train/grad_norm,█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
train/learning_rate,▆██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,█▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/loss,0.69082
eval/runtime,27.8086
eval/samples_per_second,5.394
eval/steps_per_second,0.683
total_flos,3.651520822013952e+17
train/epoch,5.0
train/global_step,750.0
train/grad_norm,0.22059
train/learning_rate,0.0
train/loss,0.4396


In [None]:
from unsloth import FastLanguageModel
import torch

# Reuse the same parameters from training
max_seq_length = 2048
dtype = None  # None for auto detection
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model",  # The directory where your model was saved
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

# Define the prompt template
prompt_template = """{instructions}

###INPUT (Notes):
{input}

###OUTPUT (Expected Generations):
{output}"""

# Function to generate text
def generate_text(input_text, max_new_tokens=1000):
    prompt = prompt_template.format(
        instructions=instructions,  # Use the instructions from your training
        input=input_text,
        output=""  # Leave this blank for generation
    )

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1000,
            use_cache=True,
            temperature=0.7,  # Adjust as needed
            top_p=0.9,  # Adjust as needed
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
input_text = "Explain the concept of recursion in programming."
generated_text = generate_text(input_text)
print(generated_text)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

    You are a computer science expert and a skilled writer.

    Craft detailed content about the given computer science subtopic for university-level lecture notes, targeting a total of about 500 words distributed over a few paragraphs.

    Begin with an introductory paragraph that lays the foundation of the subtopic. Follow this with detailed paragraphs focusing on the critical aspects of the subtopic. Include applications only if they are essential for understanding the concept; otherwise, concentrate on explaining the concept itself

In [None]:
!pip install -q datasets sacrebleu nltk

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.0/58.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from datasets import load_metric
from tqdm import tqdm
import torch
from sacrebleu.metrics import BLEU

# Load metrics
rouge = load_metric("rouge")
meteor = load_metric("meteor")
bleu = BLEU()

def evaluate_model(model, tokenizer, test_dataset, batch_size=4):
    model.eval()
    predictions = []
    references = []

    for i in tqdm(range(0, len(test_dataset), batch_size)):
        batch = test_dataset[i:i+batch_size]

        # Extract inputs and references from the batch
        inputs = batch['text']
        batch_references = batch['text']  # Assuming references are in the same 'text' field

        # Generate predictions
        encoded_inputs = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True).to(model.device)
        with torch.no_grad():
            outputs = model.generate(**encoded_inputs, max_new_tokens=1000)

        batch_predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)

        # Process predictions and references
        for pred, ref in zip(batch_predictions, batch_references):
            # Extract generated content (after "###OUTPUT (Expected Generations):")
            pred_content = pred.split("###OUTPUT (Expected Generations):")[-1].strip()
            predictions.append(pred_content)

            # Extract reference content (after "###OUTPUT (Expected Generations):")
            ref_content = ref.split("###OUTPUT (Expected Generations):")[-1].strip()
            references.append(ref_content)

    # Calculate ROUGE scores
    rouge_scores = rouge.compute(predictions=predictions, references=references, use_stemmer=True)

    # Calculate METEOR score
    meteor_score = meteor.compute(predictions=predictions, references=references)

    # Calculate BLEU score
    bleu_score = bleu.corpus_score(predictions, [references])

    # Print results
    print(f"ROUGE-1: {rouge_scores['rouge1'].mid.fmeasure:.4f}")
    print(f"ROUGE-2: {rouge_scores['rouge2'].mid.fmeasure:.4f}")
    print(f"ROUGE-L: {rouge_scores['rougeL'].mid.fmeasure:.4f}")
    print(f"METEOR: {meteor_score['meteor']:.4f}")
    print(f"BLEU: {bleu_score.score:.4f}")

    return predictions, references


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
predictions, references = evaluate_model(model, tokenizer, test_dataset)


100%|██████████| 38/38 [04:58<00:00,  7.86s/it]


ROUGE-1: 0.9750
ROUGE-2: 0.9752
ROUGE-L: 0.9753
METEOR: 0.9481
BLEU: 95.0070


# Evaluation

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>