Change runtime type to "A100". Then, you can "*Run all*" or play through each cell one-by-one, starting with the very first cell.

A BIG thank you to Unsloth for their incredible work on quantization and LoRA. See their original work [here](https://github.com/unslothai/unsloth), join their [Discord](https://discord.gg/u54VK8m8tk), or star their repo on [Github ⭐](https://github.com/unslothai/unsloth).


### Enable cell word wrapping

Enable word wrapping for all the code and output cells in this Google Colab notebook. Long lines of text will wrap to the next line instead of extending horizontally, improving readability.

In [None]:
%%capture
!pip install huggingface_hub
!pip install datasets --upgrade
!pip install --upgrade pyarrow


In [None]:
#@title Install software dependencies
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install trl "peft<0.11.0" accelerate bitsandbytes
!pip install xformers

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-xlg3756s/unsloth_18e90fa24d754ee1a45bf31ff4852e3c
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-xlg3756s/unsloth_18e90fa24d754ee1a45bf31ff4852e3c
  Resolved https://github.com/unslothai/unsloth.git to commit 62c989ef0ae0e9fbac714a4cb21eda76c1fe84b6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading tyro-0.8.10-py3-none-any.whl.metadata (8.4 kB)
Collecting sentencepiece>=0.2.0 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[

In [None]:
from IPython.display import HTML, display
from google.colab import files
import pandas as pd
from datasets import Dataset
from tqdm import tqdm
import time


def set_css():
    display(HTML('''
    <style>
        pre {
            white-space: pre-wrap;
        }
    </style>
    '''))

get_ipython().events.register('pre_run_cell', set_css)

<a name="Data"></a>
### Load and prepare the training data

Next, we use Autotokenizer and Dataset from the Hugging Face transformers and datasets libraries to convert our training data into a sequence of tokens, selecting the same base model we will later use for training.

This demo uses a custom dataset for a fictious company called Cosmic Fusion Dynamics. You may replace the dataset with your own.

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output, or you may get infinite text generation!

Download the [training dataset](https://github.com/scott4ai/llama3-8b-fine-tuning-cosmic-fusion-dynamics/blob/main/cosmic_fusion_dynamics_data.csv) to use in this notebook, then upload it to the Colab environment.


In [None]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("unsloth/llama-3-8b-bnb-4bit")

EOS_TOKEN = tokenizer.eos_token

Training_schemas_pd = pd.read_csv('df_schemas.csv')
Training_QA_pd = pd.read_csv('SQL_df.csv')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

In [None]:
## Joining the df_schema to the train data.
df_train = Training_QA_pd.merge(Training_schemas_pd, on='db_id', how='left')

# Check for any missing schemas
missing_schemas = df_train[df_train['db_schema'].isnull()]
if not missing_schemas.empty:
    print(f"Warning: {len(missing_schemas)} rows have missing schemas.")
    print("Unique db_ids with missing schemas:", missing_schemas['db_id'].unique())

# Verify the merge
print("Shape of df_train after merge:", df_train.shape)
print("Columns in df_train:", df_train.columns)

# Check for null values in the merged dataframe
null_counts = df_train.isnull().sum()
print("Null value counts:\n", null_counts)

Unique db_ids with missing schemas: ['twitter_1' 'chinook_1' 'college_2' 'epinions_1' 'small_bank_1'
 'company_1' 'formula_1' 'inn_1' 'icfp_1' 'college_1' 'student_1' 'wine_1'
 'flight_4']
Shape of df_train after merge: (7000, 7)
Columns in df_train: Index(['db_id', 'query', 'question', 'query_toks', 'query_toks_no_value',
       'question_toks', 'db_schema'],
      dtype='object')
Null value counts:
 db_id                    0
query                    0
question                 0
query_toks               0
query_toks_no_value      0
question_toks            0
db_schema              984
dtype: int64


In [None]:
# Remove all NA values from df_train
df_schema_train = df_train.dropna()
df_schema_train = df_schema_train.groupby('db_id').apply(lambda x: x.head(5)).reset_index(drop=True)
# Print the shape of the new dataframe
print("Shape of df_schema_train after removing NA values:", df_schema_train.shape)

Shape of df_schema_train after removing NA values: (635, 7)


In [None]:
# Add system instruction column
df_schema_train['instruction'] = 'You are an agent designed to interact with a SQL database. \n Given an input question, create a syntactically correct SQL query to provide to the user based on the below schema:\n'

In [None]:
df_schema_train.head()

Unnamed: 0,db_id,query,question,query_toks,query_toks_no_value,question_toks,db_schema,instruction
0,activity_1,SELECT count(*) FROM Faculty,How many faculty do we have?,"['SELECT', 'count', '(', '*', ')', 'FROM', 'Fa...","['select', 'count', '(', '*', ')', 'from', 'fa...","['How', 'many', 'faculty', 'do', 'we', 'have',...",create table Activity (\n actid INTEGER PRIMA...,You are an agent designed to interact with a S...
1,activity_1,SELECT count(*) FROM Faculty,What is the total number of faculty members?,"['SELECT', 'count', '(', '*', ')', 'FROM', 'Fa...","['select', 'count', '(', '*', ')', 'from', 'fa...","['What', 'is', 'the', 'total', 'number', 'of',...",create table Activity (\n actid INTEGER PRIMA...,You are an agent designed to interact with a S...
2,activity_1,SELECT DISTINCT rank FROM Faculty,What ranks do we have for faculty?,"['SELECT', 'DISTINCT', 'rank', 'FROM', 'Faculty']","['select', 'distinct', 'rank', 'from', 'faculty']","['What', 'ranks', 'do', 'we', 'have', 'for', '...",create table Activity (\n actid INTEGER PRIMA...,You are an agent designed to interact with a S...
3,activity_1,SELECT DISTINCT rank FROM Faculty,Find the list of distinct ranks for faculty.,"['SELECT', 'DISTINCT', 'rank', 'FROM', 'Faculty']","['select', 'distinct', 'rank', 'from', 'faculty']","['Find', 'the', 'list', 'of', 'distinct', 'ran...",create table Activity (\n actid INTEGER PRIMA...,You are an agent designed to interact with a S...
4,activity_1,SELECT DISTINCT building FROM Faculty,Show all the distinct buildings that have facu...,"['SELECT', 'DISTINCT', 'building', 'FROM', 'Fa...","['select', 'distinct', 'building', 'from', 'fa...","['Show', 'all', 'the', 'distinct', 'buildings'...",create table Activity (\n actid INTEGER PRIMA...,You are an agent designed to interact with a S...


In [None]:
# def combine_texts(question, answer):
#     return {
#         "text": f"###{question}@@@{answer}{EOS_TOKEN}",
#     }

# def load_data_from_csv(file_path):
#     try:
#         df = pd.read_csv(file_path)
#         return df['Question'].tolist(), df['Answer'].tolist()
#     except FileNotFoundError:
#         raise FileNotFoundError(f"File not found: {file_path}")
#     except pd.errors.EmptyDataError:
#         raise pd.errors.EmptyDataError(f"Empty file: {file_path}")
#     except KeyError as e:
#         raise KeyError(f"Missing column: {str(e)}")

# # Load data from CSV file
# questions, answers = load_data_from_csv(TRAINING_DATA_PATH)

# # Prepare the fine-tuning training dataset
# if questions and answers:
#     # Combine questions and answers with the instruction and EOS_TOKEN
#     # EOS_TOKEN prevents infinite generation during inference
#     combined_texts = [combine_texts(question, answer) for question, answer in zip(questions, answers)]

#     # Create the fine-tuning dataset
#     dataset = Dataset.from_dict({"text": [ct["text"] for ct in combined_texts]})

#     # Check if the dataset is not empty before accessing its first element
#     if len(dataset) > 0:
#         print("Example training record:\n")
#         print(dataset[0]['text'])
#     else:
#         print("The fine-tuning dataset is empty.")
# else:
#     print("Failed to create the fine-tuning dataset.")


Collecting pyarrow
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 12.0.1
    Uninstalling pyarrow-12.0.1:
      Successfully uninstalled pyarrow-12.0.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed pyarrow-17.0.0


In [None]:
SQL_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    schema = examples["db_schema"]
    inputs = examples["question"]
    outputs = examples["query"]
    texts = []
    for instruction, schema, input, output in zip(instructions, schema, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = SQL_prompt.format(instruction, schema[0:15000], input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass
dataset = Dataset.from_pandas(df_schema_train)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/635 [00:00<?, ? examples/s]

In [None]:
dataset['text'][0]

<a name="Analyze"></a>
### Analyze the training dataset and find the largest example

Analyze the training dataset and autocalcluate max_sequence_length, that is, how much space is needed to process the largest example. Examples shorter than max_sequence_length must be padded with special tokens, which we want to minimize. Here's why:

1. The presence of excessive padding can hinder the model's ability to learn meaningful representations of the input text. The model may struggle to capture the important features and patterns in the input when a significant portion of it consists of padding tokens.

2. If a large portion of your input sequences consist of padding tokens, it can limit the effective capacity of your model. The model's attention mechanism and other computations will be applied to the padding tokens as well, which can reduce the model's ability to focus on and learn from the meaningful parts of the input.

3. When there is excessive padding, the model may start to learn patterns or dependencies related to the padding tokens rather than the actual content of the input. This can lead to overfitting, where the model becomes overly reliant on the padding and fails to generalize well to new, unseen data.

4. Longer sequences require more memory and computation time during training and inference. The model would spend a significant portion of the computation during training on processing padding tokens. This can lead to slower training and reduced computational efficiency, as the model is essentially processing a large number of meaningless tokens.



In [None]:
# prompt: For text in dataset find the max word count

max_word_count = 0
for text in dataset['text']:
  word_count = len(text.split())
  if word_count > max_word_count:
    max_word_count = word_count

print(f"Maximum word count in the dataset: {max_word_count}")
max_seq_length = max_word_count * 2

Maximum word count in the dataset: 2192


In [None]:
max_instruction = 0
max_schema = 0
max_input = 0
max_output = 0
max_total = 0
max_length = 1024

instructions = df_schema_train["instruction"]
schemas = df_schema_train["db_schema"]
inputs = df_schema_train["question"]
outputs = df_schema_train["query"]
texts = []
for instruction, schema, input, output in tqdm(zip(instructions, schemas, inputs, outputs)):
    # Count the number of tokens in the input data fields
    instruction_tokens = tokenizer.encode_plus(instruction, add_special_tokens=False, max_length=max_length)["input_ids"]
    schema_tokens = tokenizer.encode_plus(schema, add_special_tokens=False, max_length=max_length)["input_ids"]
    input_tokens = tokenizer.encode_plus(input, add_special_tokens=False, max_length=max_length)["input_ids"]
    output_tokens = tokenizer.encode_plus(output, add_special_tokens=False, max_length=max_length)["input_ids"]
    total_tokens = tokenizer.encode_plus(f"{instruction} {schema} {input} {output}", add_special_tokens=False, max_length=max_length)["input_ids"]

    max_instruction = max(max_instruction, len(instruction_tokens))
    max_schema = max(max_schema, len(schema_tokens))
    max_input = max(max_input, len(input_tokens))
    max_output = max(max_output, len(output_tokens))
    max_total = max(max_total, len(total_tokens))

# Display the results
print(f"\nMaximum token counts:")
print(f"Instruction: {max_instruction}")
print(f"Schema: {max_schema}")
print(f"Input: {max_input}")
print(f"Output: {max_output}")
print(f"Total: {max_total}")
#
# Add a small buffer to the maximum token count
buffer = 10
# max_seq_length can be set up to 2x the default context length
# of the base model because Unsloth supports RoPE Scaling internally.
# Here, we auto-configure this length based on input data analysis.
max_seq_length = max_total + buffer

# Display the table header
table_title = "Training Data Token Counts"
print(f"\n{table_title:-^70}")
print(f"{'Measure':<14}{'Instruction':<14}{'Schema':<14}{'Input':<14}{'Output':<14}{'Total':<14}")

# Display token counts in tabular form
print(f"{'Maximums':<14}{max_instruction:<14}{max_schema:<14}{max_input:<14}{max_output:<14}{max_total:<14}")
print(f"{'Max Seq Len':<14}{'':<14}{'':<14}{'':<14}{'':<14}{max_seq_length:<14}\n")

print(f"Set max_seq_length in FastLanguageModel to {max_seq_length} to handle the maximum number of tokens required by the input training data (Total Maximum + Buffer).")



0it [00:00, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
635it [00:02, 266.06it/s]


Maximum token counts:
Instruction: 37
Schema: 1024
Input: 48
Output: 119
Total: 1024

----------------------Training Data Token Counts----------------------
Measure       Instruction   Schema        Input         Output        Total         
Maximums      37            1024          48            119           1024          
Max Seq Len                                                           1034          

Set max_seq_length in FastLanguageModel to 1034 to handle the maximum number of tokens required by the input training data (Total Maximum + Buffer).





In [None]:
#@title Show pre-training GPU memory stats
import torch

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
0.0 GB of memory reserved.


In [None]:
#@title Upload a 4-bit quantized base model for training
from unsloth import FastLanguageModel
import torch


dtype = None # None for auto detection. Bfloat16 for Ampere+. Float16 for Tesla T4 & V100.
load_in_4bit = True # Use 4-bit quantization to reduce memory usage. Can be False.

# Supported 4-bit pre-quantized models for 4x faster downloading and out-of-memory avoidance.
# Find more models at https://huggingface.co/unsloth
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # add a Hugging Face access token if using a private or gated model
)

==((====))==  Unsloth 2024.9: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

<a name="PEFT"></a>
### Convert the pre-trained base model into a PEFT model

Add LoRA adaptation modules to prepare the model for fine-tuning. With LoRA layers, we only need to update less than 10% of the parameters compared with fine-tuning the base model directly!

PEFT = parameter-efficient fine-tuning.

`r` is a LoRA ranking parameter that affects the amount of compute required, model complexity, quality and potential overfitting. A higher value may capture more complex patterns and transformations, while adding more parameters.

`lora_alpha` is a weighting parameter that controls the degree to which the LoRA parameters are combined with the base model's parameters during fine-tuning. It affects how much influence LoRA parameters have on the model's output. A higher value has a larger influence, while adding more parameters.

Both `r` and `lora_alpha` interact with each other and influence the number of additional parameters introduced by LoRA. Experiment with these parameters to determine what works best to train your dataset.

In [None]:
# #Applying rope scaling factor

# # Set the RoPE scaling factor to 1.904 as recommended by kaiokendev
# #rope_scaling_factor = 1.904
# rope_scaling_factor = 1.0
# def apply_rope_scaling(model, scaling_factor):
#     for module in model.modules():
#         if hasattr(module, 'rotary_emb'):
#             # Adjust rotary embedding scaling
#             module.rotary_emb.scaling_factor = scaling_factor
#             print(f"Set RoPE scaling to: {scaling_factor}")

# # Apply the scaling factor to the model
# apply_rope_scaling(model, rope_scaling_factor)

In [None]:
from unsloth import FastLanguageModel

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Test the model before fine-tuning to note the results

Try interference on the model before fine-tuning to show that the model does not know about any of the facts in the dataset.

Uncomment only one line at a time from the question list below and run the cell.

In [None]:
from unsloth import FastLanguageModel
max_context_length = 1024
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    """PRAGMA foreign_keys = ON;

CREATE TABLE "people" (
"People_ID" int,
"Name" text,
"Country" text,
"Is_Male" text,
"Age" int,
PRIMARY KEY ("People_ID")
);





CREATE TABLE "church" (
"Church_ID" int,
"Name" text,
"Organized_by" text,
"Open_Date" int,
"Continuation_of" text,
PRIMARY KEY ("Church_ID")
);




CREATE TABLE "wedding" (
"Church_ID" int,
"Male_ID" int,
"Female_ID" int,
"Year" int,
PRIMARY KEY ("Church_ID","Male_ID","Female_ID"),
FOREIGN KEY ("Church_ID") REFERENCES `church`("Church_ID"),
FOREIGN KEY ("Male_ID") REFERENCES `people`("People_ID"),
FOREIGN KEY ("Female_ID") REFERENCES `people`("People_ID")
);
Q:
Can you give me a table with Church name and year, male_id, and female id from the wedding
A:"""
    # "Where is Cosmic Fusion Dynamics headquartered?"
    # "Who is the current CEO of Cosmic Fusion Dynamics?"
    # "What is the name of Cosmic Fusion Dynamics' flagship product?"
    # "What award did Cosmic Fusion Dynamics earn in 2021?"
    # "What does Cosmic Fusion Dynamics specialize in?"
    # "Describe FinanceAI from Cosmic Fusion Dynamics."
    # "How much Series A funding did Cosmic Fusion Dynamics receive?"
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64)
decoded_output = tokenizer.batch_decode(outputs)
print(decoded_output)

['<|begin_of_text|>PRAGMA foreign_keys = ON;\n\nCREATE TABLE "people" (\n"People_ID" int,\n"Name" text,\n"Country" text,\n"Is_Male" text,\n"Age" int,\nPRIMARY KEY ("People_ID")\n);\n\n\n\n\n\nCREATE TABLE "church" (\n"Church_ID" int,\n"Name" text,\n"Organized_by" text,\n"Open_Date" int,\n"Continuation_of" text,\nPRIMARY KEY ("Church_ID")\n);\n\n\n\n\nCREATE TABLE "wedding" (\n"Church_ID" int,\n"Male_ID" int,\n"Female_ID" int,\n"Year" int,\nPRIMARY KEY ("Church_ID","Male_ID","Female_ID"),\nFOREIGN KEY ("Church_ID") REFERENCES `church`("Church_ID"),\nFOREIGN KEY ("Male_ID") REFERENCES `people`("People_ID"),\nFOREIGN KEY ("Female_ID") REFERENCES `people`("People_ID")\n);\nQ:\nCan you give me a table with Church name and year, male_id, and female id from the wedding\nA: SELECT church.Name, wedding.Year, wedding.Male_ID, wedding.Female_ID FROM church JOIN wedding ON church.Church_ID = wedding.Church_ID;\n\n\n\n\n\nQ:\nCan you give me a table with the name of the person, country, and age of 

In [None]:
#@title Check the numerical precision supported by the GPU
print (f"GPU supports {'brain' if torch.cuda.is_bf16_supported() else 'half-precision'} floating-point.")

GPU supports brain floating-point.


### Configure the Supervised Fine-Tuning (SFT) Trainer

Next, let's configure Hugging Face's TRL `SFTTrainer`.

* `max_seq_length` can be larger than the model's context limit since Unsloth uses automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.

* Experiment with the number of training iterations to determine what you need to achieve a good training result. You can either specify `num_train_epochs` or `max_steps` as a parameter below, but not both. For example, set `num_train_epochs=4` for four full passes through the training data, or set `max_steps=80` to train on 80 random combinations of examples.

* Each step trains on a number of examples equal to GPU count $\times$ `per_device_train_batch_size` $\times$ `gradient_accumulation_steps`. So, for example, with `per_device_train_batch_size` of 3, `gradient_accumulation_steps` of 2, and 1 GPU, the effective batch size would be 6 (3 $\times$ 2 $\times$ 1). A 30-record dataset would take 5 steps to complete (30 $\div$ 6). If you train for 80 `max_steps`, you would run through all the data 16 times (80 $\div$ 5), or 16 epochs. The dataset is reshuffled at the start of each epoch.

* More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    #max_seq_length = max_seq_length,
    max_seq_length = 1024,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 2,
        warmup_steps = 3,
        max_steps = 80,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3406,
        output_dir = "outputs",
        gradient_checkpointing=True,
    ),
)

model.gradient_checkpointing_enable()

Map (num_proc=2):   0%|          | 0/635 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


### Train the model

Train for the selected number of steps. Look for `Training Loss` to follow a decreasing trend.

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 635 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 2
\        /    Total batch size = 2 | Total steps = 80
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.1309
2,1.9719
3,1.3743
4,1.4969
5,1.5985
6,1.3252
7,0.6969
8,1.218
9,0.7194
10,0.6452


In [None]:
#@title Show post-training GPU memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

76.1901 seconds used for training.
1.27 minutes used for training.
Peak reserved memory = 7.254 GB.
Peak reserved memory for training = 7.254 GB.
Peak reserved memory % of max memory = 18.335 %.
Peak reserved memory for training % of max memory = 18.335 %.


<a name="Inference"></a>
### Inference: Generate output token-by-token

Uncomment only one line at a time from the question list below and run the cell.

Output will be generated token-by-token with TextStreamer for continuous inference.

In [None]:
from unsloth import FastLanguageModel
# max_context_length = 2048  # Adjust based on your model's capacity
# model.config.max_position_embeddings = max_context_length
# FastLanguageModel.for_inference(model, max_seq_len=max_context_length)


FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    """PRAGMA foreign_keys = ON;

CREATE TABLE "people" (
"People_ID" int,
"Name" text,
"Country" text,
"Is_Male" text,
"Age" int,
PRIMARY KEY ("People_ID")
);





CREATE TABLE "church" (
"Church_ID" int,
"Name" text,
"Organized_by" text,
"Open_Date" int,
"Continuation_of" text,
PRIMARY KEY ("Church_ID")
);




CREATE TABLE "wedding" (
"Church_ID" int,
"Male_ID" int,
"Female_ID" int,
"Year" int,
PRIMARY KEY ("Church_ID","Male_ID","Female_ID"),
FOREIGN KEY ("Church_ID") REFERENCES `church`("Church_ID"),
FOREIGN KEY ("Male_ID") REFERENCES `people`("People_ID"),
FOREIGN KEY ("Female_ID") REFERENCES `people`("People_ID")
);
Q:
Can you give me a table with Church name and year, male_id, and female id from the wedding
A:"""
    # "Where is Cosmic Fusion Dynamics headquartered?"
    # "Who is the current CEO of Cosmic Fusion Dynamics?"
    # "What is the name of Cosmic Fusion Dynamics' flagship product?"
    # "What award did Cosmic Fusion Dynamics earn?""Who founded Cosmic Fusion Dynamics?"
    # "What does Cosmic Fusion Dynamics specialize in?"
    # "Describe FinanceAI from Cosmic Fusion Dynamics."
    # "How much Series A funding did Cosmic Fusion Dynamics receive?"
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

<|begin_of_text|>PRAGMA foreign_keys = ON;

CREATE TABLE "people" (
"People_ID" int,
"Name" text,
"Country" text,
"Is_Male" text,
"Age" int,
PRIMARY KEY ("People_ID")
);





CREATE TABLE "church" (
"Church_ID" int,
"Name" text,
"Organized_by" text,
"Open_Date" int,
"Continuation_of" text,
PRIMARY KEY ("Church_ID")
);




CREATE TABLE "wedding" (
"Church_ID" int,
"Male_ID" int,
"Female_ID" int,
"Year" int,
PRIMARY KEY ("Church_ID","Male_ID","Female_ID"),
FOREIGN KEY ("Church_ID") REFERENCES `church`("Church_ID"),
FOREIGN KEY ("Male_ID") REFERENCES `people`("People_ID"),
FOREIGN KEY ("Female_ID") REFERENCES `people`("People_ID")
);
Q:
Can you give me a table with Church name and year, male_id, and female id from the wedding
A:SELECT c.Name, w.Year, p1.People_ID, p2.People_ID FROM church c INNER JOIN wedding w ON c.Church_ID = w.Church_ID INNER JOIN people p1 ON w.Male_ID = p1.People_ID INNER JOIN people p2 ON w.Female_ID = p2


### Inference: Generate output all-at-once

Uncomment only one line at a time from the question list below and run the cell.

Output will be generated all-at-once.

In [None]:
from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    "Who founded Cosmic Fusion Dynamics?"
    # "Where is Cosmic Fusion Dynamics headquartered?"
    # "Who is the current CEO of Cosmic Fusion Dynamics?"
    # "What is the name of Cosmic Fusion Dynamics' flagship product?"
    # "What award did Cosmic Fusion Dynamics earn in 2021?"
    # "What does Cosmic Fusion Dynamics specialize in?"
    # "Describe FinanceAI from Cosmic Fusion Dynamics."
    # "How much Series A funding did Cosmic Fusion Dynamics receive?"
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64)
decoded_output = tokenizer.batch_decode(outputs)
print(decoded_output)

['<|begin_of_text|>Who founded Cosmic Fusion Dynamics? Cosmic Fusion Dynamics is a new startup company that is building a new space propulsion system. The company was founded in 2016 by Dr. Kevin Parkin, who is a former NASA astronaut. The company’s mission is to develop a new type of space propulsion system that is more efficient and less expensive than traditional rocket engines']


### Inference: Generate output all-at-once with post-processing

Uncomment only one line at a time from the question list below and run the cell.

Output will be generated all-at-once, then post-processed.

Post-processing will search for and remove the delimiter from the output, producing clean output.

In [None]:
import re

def extract_answer(text):
    # Remove the begin and end tokens
    text = re.sub(r'<\|begin_of_text\|>|<\|end_of_text\|>', '', text)
    # Split the text based on the "@@@" delimiter
    parts = re.split(r'A:', text)
    # Return the result
    return parts[1].strip() if len(parts) == 2 else text.strip()

from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    # "Who founded Cosmic Fusion Dynamics?"
    # "Where is Cosmic Fusion Dynamics headquartered?"
    """PRAGMA foreign_keys = ON;

CREATE TABLE "people" (
"People_ID" int,
"Name" text,
"Country" text,
"Is_Male" text,
"Age" int,
PRIMARY KEY ("People_ID")
);





CREATE TABLE "church" (
"Church_ID" int,
"Name" text,
"Organized_by" text,
"Open_Date" int,
"Continuation_of" text,
PRIMARY KEY ("Church_ID")
);




CREATE TABLE "wedding" (
"Church_ID" int,
"Male_ID" int,
"Female_ID" int,
"Year" int,
PRIMARY KEY ("Church_ID","Male_ID","Female_ID"),
FOREIGN KEY ("Church_ID") REFERENCES `church`("Church_ID"),
FOREIGN KEY ("Male_ID") REFERENCES `people`("People_ID"),
FOREIGN KEY ("Female_ID") REFERENCES `people`("People_ID")
);
Q:
Can you give me a table with Church name and year, male_id, and female id from the wedding
A:"""
    # "What is the name of Cosmic Fusion Dynamics' flagship product?"
    # "What award did Cosmic Fusion Dynamics earn in 2021?"
    # "What does Cosmic Fusion Dynamics specialize in?"
    # "Describe FinanceAI from Cosmic Fusion Dynamics."
    # "How much Series A funding did Cosmic Fusion Dynamics receive?"
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128)
decoded_output = tokenizer.batch_decode(outputs)
# Post-process the output
print(extract_answer(decoded_output[0]))

SELECT c.name, w.year, w.male_id, w.female_id FROM church c JOIN wedding w ON c.church_id = w.church_id


### Save only the LoRA adapters locally to Colab

To run this cell, first change `False` to `True`.

**[NOTE]** This ONLY saves the LoRA adapters, not the full model. To save the full model, see GGUF and vLLM options below!

**[NOTE]** Colab deletes local models when your Colab session ends.

In [None]:
if True:
    local_model_name = "llama3-8b-cosmic-fusion-dynamics-lora"
    model.save_pretrained(local_model_name)
    tokenizer.save_pretrained(local_model_name)

<a name="Token"></a>
### Create a Hugging Face access token

If you'd like to upload your models to Hugging Face, you'll need a free account and an access token.

Here's how to get an access token from Hugging Face:

* Create a free [Hugging Face account](https://huggingface.co/join).
* Go to the [access tokens page](https://huggingface.co/settings/tokens).
* Click `New token`.
* Choose any name.
* For `Type`, select `Fine-grained custom`.
* Choose these fine-grained options under `Repos`:
  * Read access to contents of all repos under your personal namespace
  * Write access to contents/settings of all repos under your personal namespace
* Click `Generate a token`. Don't share it with anyone!

<a name="Save"></a>
### Save the LoRA adapters to Hugging Face

To run this cell, first change `False` to `True`.

**[NOTE]** This ONLY saves the LoRA adapters, not the full model. To save the full model, see GGUF and vLLM options below!

**[NOTE]** This model will persist beyond your Colab session in Hugging Face.

Running the cell below will save the model as a public repo to your account on Hugging Face. To save as a private repo, add parameter `private=True` to both the `push_to_hub()` calls.

Set `token` to your Hugging Face access token, or add a Colab Secret called `HUGGING_FACE_HUB_TOKEN`.

Create a Hugging Face [access token](#Token).

In [None]:
if True:
    from google.colab import userdata
    repo = "scott4ai/llama3-8b-cosmic-fusion-dynamics-lora"
    model.push_to_hub(repo, token=userdata.get('HUGGING_FACE_HUB_TOKEN'))
    tokenizer.push_to_hub(repo, token=userdata.get('HUGGING_FACE_HUB_TOKEN'))

README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

Saved model to https://huggingface.co/scott4ai/llama3-8b-cosmic-fusion-dynamics-lora


### Save in GGUF / llama.cpp format locally to Colab

Use `save_pretrained_gguf()` to save locally.

To save to `GGUF` / `llama.cpp`, Unsloth clones `llama.cpp` saves it by default in `q8_0` format.

See all the supported quantization methods on the [Unsloth Wiki](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options). Here are some options:
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

Host GGUF models for inference in `llama.cpp`, Ollama, or a UI based system like [GPT4All](https://gpt4all.io/index.html).

In [None]:
from google.colab import userdata

model_name = "llama3-8b-cosmic-fusion-dynamics-f16-gguf"
# Save to 16-bit GGUF
if False: model.save_pretrained_gguf(model_name, tokenizer, quantization_method = "f16")

model_name = "llama3-8b-cosmic-fusion-dynamics-gguf"
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf(model_name, tokenizer, quantization_method = "q4_k_m")

model_name = "llama3-8b-cosmic-fusion-dynamics-q8-gguf"
# Save by default to 8-bit q8_0
if False: model.save_pretrained_gguf(model_name, tokenizer)

### Save in GGUF / llama.cpp format to Hugging Face

Use `push_to_hub_gguf()` to save to Hugging Face.

Create a Hugging Face [access token](#Token).

Host GGUF models for inference in `llama.cpp`, Ollama, or a UI based system like [GPT4All](https://gpt4all.io/index.html).

In [None]:
from google.colab import userdata

model_name = "scott4ai/llama3-8b-cosmic-fusion-dynamics-f16-gguf"
# Save to 16-bit GGUF
if False: model.push_to_hub_gguf(model_name, tokenizer, quantization_method = "f16", token = userdata.get('HUGGING_FACE_HUB_TOKEN'))

model_name = "scott4ai/llama3-8b-cosmic-fusion-dynamics-gguf"
# Save to q4_k_m GGUF
if False: model.push_to_hub_gguf(model_name, tokenizer, quantization_method = "q4_k_m", token = userdata.get('HUGGING_FACE_HUB_TOKEN'))

model_name = "scott4ai/llama3-8b-cosmic-fusion-dynamics-q8-gguf"
# Save by default to 8-bit q8_0
if False: model.push_to_hub_gguf(model_name, tokenizer, token = userdata.get('HUGGING_FACE_HUB_TOKEN'))

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 60.93 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 45.14it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GUUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to f16 will take 20 minutes.
 "-____-"     In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...


Unsloth: We must use f16 for non Llama and Mistral models.


Unsloth: [1] Converting model at scott4ai/llama3-8b-cosmic-fusion-dynamics-f16-gguf into f16 GGUF format.
The output location will be ./scott4ai/llama3-8b-cosmic-fusion-dynamics-f16-gguf-unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3-8b-cosmic-fusion-dynamics-f16-gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type 

### Save in float16 vLLM format locally to Colab

*   List item
*   List item



You can save in these formats:
* `float16` with `merged_16bit`
* `int4` with `merged_4bit`
* LoRA adapters with `lora`

Use `save_pretrained_merged()` to save locally.


In [None]:
from google.colab import userdata

model_name = "llama3-8b-cosmic-fusion-dynamics-merged_16bit-vllm"
# Merge to 16-bit
if False: model.save_pretrained_merged(model_name, tokenizer, save_method = "merged_16bit")

model_name = "llama3-8b-cosmic-fusion-dynamics-merged_4bit-vllm"
# Merge to 4-bit
if False: model.save_pretrained_merged(model_name, tokenizer, save_method = "merged_4bit")

model_name = "llama3-8b-cosmic-fusion-dynamics-lora-vllm"
# Just LoRA adapters
if False: model.save_pretrained_merged(model_name, tokenizer, save_method = "lora")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 55.49 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 57.72it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


### Save in float16 vLLM format to Hugging Face

Use `push_to_hub_merged()` to upload to Hugging Face.

Create a Hugging Face [access token](#Token).

In [None]:
from google.colab import userdata

model_name = "pinkpeach/llama3-8b-cosmic-fusion-dynamics-merged_16bit-vllm"
# Merge to 16-bit
if True: model.push_to_hub_merged(model_name, tokenizer, save_method = "merged_16bit", token = userdata.get('HUGGING_FACE_HUB_TOKEN'))

model_name = "pinkpeach/llama3-8b-cosmic-fusion-dynamics-merged_4bit-vllm"
# Merge to 4-bit
if True: model.push_to_hub_merged(model_name, tokenizer, save_method = "merged_4bit", token = userdata.get('HUGGING_FACE_HUB_TOKEN'))

model_name = "pinkpeach/llama3-8b-cosmic-fusion-dynamics-lora-vllm"
# Just LoRA adapters
if True: model.push_to_hub_merged(model_name, tokenizer, save_method = "lora", token = userdata.get('HUGGING_FACE_HUB_TOKEN'))

Unsloth: You are pushing to hub, but you passed your HF username = pinkpeach.
We shall truncate pinkpeach/llama3-8b-cosmic-fusion-dynamics-merged_16bit-vllm to llama3-8b-cosmic-fusion-dynamics-merged_16bit-vllm
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 54.75 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 62.69it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


README.md:   0%|          | 0.00/576 [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/pinkpeach/llama3-8b-cosmic-fusion-dynamics-merged_16bit-vllm


RuntimeError: Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan
to merge to GGUF or others later on. I suggest you to do this as a final step
if you're planning to do multiple saves.
If you are certain, change `save_method` to `merged_4bit_forced`.

<a name="Load"></a>
### Load a full model or LoRA adapters

To run this cell, first change `False` to `True` and uncomment the appropriate line for what you want to load.

**[NOTE]** You should restart the runtime session to clear out previous model data first, and redo the necessary steps.

To load the model from Colab locally, omit the Hugging Face account prefix before the `model_name`. To load from Hugging Face, add your account prefix to the `model_name`, like "account/model".

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        # load a model from the local Colab environment
        # model_name = "llama3-8b-cosmic-fusion-dynamics-lora"

        # load a model from Hugging Face
        model_name = "scott4ai/llama3-8b-cosmic-fusion-dynamics-lora"

        # use HF access token for private or gated models
        # token = userdata.get('HUGGING_FACE_HUB_TOKEN'),
    )

    # Run a quick inference test on the model
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
    inputs = tokenizer(
    [
        "Who founded Cosmic Fusion Dynamics?"
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = False)
    decoded_output = tokenizer.batch_decode(outputs)
    print(decoded_output)

Try [inference](#Inference) on the model.

And we're done!

If you have any questions, find bugs, or want to stay updated with the latest LLM stuff, join Unsloth on [Discord](https://discord.gg/u54VK8m8tk).
