# AI Motivational Quote Generator

This notebook fine-tunes the TinyLlama model using PEFT (LoRA) to generate motivational quotes based on a topic.



### Install Dependencies

In [1]:
print("Installing dependencies...")
!pip install -q transformers datasets accelerate peft trl bitsandbytes huggingface_hub llama-cpp-python
print("‚úÖ Dependencies installed.")

Installing dependencies...
‚úÖ Dependencies installed.


### Import Libraries

In [24]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer, SFTConfig
from huggingface_hub import notebook_login, HfApi, upload_file
from llama_cpp import Llama
import os

print("‚úÖ Libraries imported.")

‚úÖ Libraries imported.


### Load Dataset and Preprocessing

In [3]:
print("Loading and preprocessing dataset...")
# Load the dataset
dataset = load_dataset("Abirate/english_quotes", split="train")

# Preprocessing function
# We format the data into a "prompt" structure:
# "Keyword: [TAG]\nQuote: [QUOTE]"
def format_dataset(example):
    # Check if the 'tags' list is not empty
    if example['tags'] and len(example['tags']) > 0:
        # Get the first tag as our keyword
        keyword = example['tags'][0]
        quote = example['quote']

        # Create the formatted string
        return {"text": f"Keyword: {keyword}\nQuote: {quote}"}
    else:
        # If no tags, we can't use this example
        return {"text": None}

# Apply the function and filter out the None entries
processed_dataset = dataset.map(format_dataset)
processed_dataset = processed_dataset.filter(lambda x: x['text'] is not None)

print("--- Sample of Processed Data ---")
print(processed_dataset[0]['text'])
print("---------------------------------")
print("‚úÖ Dataset ready.")

Loading and preprocessing dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


--- Sample of Processed Data ---
Keyword: be-yourself
Quote: ‚ÄúBe yourself; everyone else is already taken.‚Äù
---------------------------------
‚úÖ Dataset ready.


### Configure 4-bit Quantization (QLoRA)

In [4]:
# Load the model in 4-bit precision to fit on Colab GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print("‚úÖ 4-bit config created.")

‚úÖ 4-bit config created.


### Load TinyLlama Model and Tokenizer

In [5]:
print("Loading TinyLlama model and tokenizer...")
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load the model with our 4-bit config
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto", # Automatically puts the model on the GPU
    trust_remote_code=True,
)
model.config.use_cache = False # Recommended for training

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("‚úÖ Model and tokenizer loaded.")

Loading TinyLlama model and tokenizer...
‚úÖ Model and tokenizer loaded.


### Configure PEFT (LoRA)

In [6]:
# We only train a small set of "adapter" weights
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
print("‚úÖ LoRA config created.")

‚úÖ LoRA config created.


### Configure Training Arguments

In [13]:
training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    logging_steps=25,
    save_strategy="epoch",
    dataset_text_field="text",       # The column name of our formatted text
    max_length=512,              # Max sequence length
    report_to="none"
)
print("‚úÖ SFTConfig set.")

‚úÖ SFTConfig set.


### Initialize the SFTTrainer

In [14]:
trainer = SFTTrainer(
    model=model,
    train_dataset=processed_dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_args,
)
print("‚úÖ SFTTrainer initialized.")



‚úÖ SFTTrainer initialized.


### Start Training

In [15]:
print("üöÄ Starting model fine-tuning...")
trainer.train()
print("‚úÖ Training complete!")

üöÄ Starting model fine-tuning...


  return fn(*args, **kwargs)


Step,Training Loss
25,2.5045
50,2.1085
75,1.927
100,1.9656
125,1.9317
150,1.9363


‚úÖ Training complete!


### Merge LoRA Adapters and Save Full Model

In [25]:
# This combines the base TinyLlama model with our trained LoRA adapters
# into a single, fine-tuned model for conversion.
print("Saving trained adapters...")
adapter_dir = "trained_adapters"
trainer.save_model(adapter_dir) # Save the LoRA adapters

# --- De-quantization Step ---

# 1. Clear the 4-bit model and trainer from memory
del model
del trainer
torch.cuda.empty_cache()
print("Cleared 4-bit model from memory.")

# 2. Reload the base model in float16
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
print(f"Reloading base model {model_name} in float16...")
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16, # <-- Load in full float16 precision
    device_map="auto",
    trust_remote_code=True,
)

# 3. Load the adapters onto this new float16 model
print(f"Loading adapters from {adapter_dir}...")
merged_model = PeftModel.from_pretrained(base_model, adapter_dir)

# 4. Merge the adapters into the float16 model
print("Merging adapters...")
merged_model = merged_model.merge_and_unload()

# 5. Save the final, de-quantized model
merged_model_dir = "merged_model"
print(f"Saving merged float16 model to {merged_model_dir}...")
merged_model.save_pretrained(merged_model_dir)

# 6. Save the tokenizer (we must reload it)
print(f"Saving tokenizer to {merged_model_dir}...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.save_pretrained(merged_model_dir)

print(f"‚úÖ Merged float16 model saved to '{merged_model_dir}' directory.")

Saving trained adapters...
Cleared 4-bit model from memory.
Reloading base model TinyLlama/TinyLlama-1.1B-Chat-v1.0 in float16...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading adapters from trained_adapters...
Merging adapters...




Saving merged float16 model to merged_model...
Saving tokenizer to merged_model...
‚úÖ Merged float16 model saved to 'merged_model' directory.


### Clone the Llama.cpp Repository

In [22]:
# 1. Remove the old, failed directory
print("Removing old 'llama.cpp' directory...")
!rm -rf llama.cpp

# 2. Clone the repository fresh
print("Cloning llama.cpp repository...")
!git clone https://github.com/ggerganov/llama.cpp.git

# 3. Build with CMake
print("Building 'llama-quantize' executable with CMake...")
!cd llama.cpp && cmake -B build && cmake --build build --config Release

# 4. Check if the executable was built in its new location
import os
# The new path for the executable is inside 'build/bin/'
quantize_executable_path = "llama.cpp/build/bin/llama-quantize"

if os.path.exists(quantize_executable_path):
    print(f"‚úÖ '{quantize_executable_path}' executable built successfully.")
else:
    print(f"‚ùå ERROR: '{quantize_executable_path}' not found after build. Check 'cmake' output for errors.")

Removing old 'llama.cpp' directory...
Cloning llama.cpp repository...
Cloning into 'llama.cpp'...
remote: Enumerating objects: 67212, done.[K
remote: Counting objects: 100% (303/303), done.[K
remote: Compressing objects: 100% (174/174), done.[K
remote: Total 67212 (delta 245), reused 129 (delta 129), pack-reused 66909 (from 4)[K
Receiving objects: 100% (67212/67212), 194.00 MiB | 15.82 MiB/s, done.
Resolving deltas: 100% (48805/48805), done.
Building 'llama-quantize' executable with CMake...
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features -

### Convert Merged Model to GGUF

In [26]:
# 1. Convert the merged HF model to an intermediate F16 GGUF.
# 2. Quantize the F16 GGUF to our target Q4_K_M.

intermediate_gguf = "model-f16.gguf"
output_gguf_file = "tinyllama-quotes-Q4_K_M.gguf"
merged_model_path = "merged_model"

# --- Step 1: Convert to F16 GGUF ---
print(f"Converting model to intermediate F16 GGUF: {intermediate_gguf}")
!python llama.cpp/convert_hf_to_gguf.py {merged_model_path} \
    --outfile {intermediate_gguf} \
    --outtype f16

if not os.path.exists(intermediate_gguf):
    print(f"‚ùå ERROR: Intermediate file {intermediate_gguf} was not created. Stopping.")
else:
    print(f"‚úÖ Intermediate F16 model created.")

    # --- Step 2: Quantize to Q4_K_M ---
    print(f"Quantizing {intermediate_gguf} to {output_gguf_file}...")

    # We use the 'quantize' executable we built in the previous cell
    # The format is: ./quantize [INPUT_FILE] [OUTPUT_FILE] [QUANT_TYPE]
    !./llama.cpp/build/bin/llama-quantize {intermediate_gguf} {output_gguf_file} Q4_K_M

    # Check if the file was created successfully
    if os.path.exists(output_gguf_file):
        print(f"\n‚úÖ Final GGUF model saved as '{output_gguf_file}'")
        # Check the file size
        print("--- GGUF File Details ---")
        !ls -lh {output_gguf_file}
        print("-------------------------")

        # Optional: Clean up the large intermediate file
        !rm {intermediate_gguf}
        print(f"Cleaned up intermediate file: {intermediate_gguf}")

    else:
        print(f"\n‚ùå ERROR: GGUF file creation failed. '{output_gguf_file}' was not found.")
        print("Please check the 'quantize' command output above for errors.")

Converting model to intermediate F16 GGUF: model-f16.gguf
INFO:hf-to-gguf:Loading model: merged_model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:output.weight,               torch.float16 --> F16, shape = {2048, 32000}
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {2048, 32000}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {5632, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {2048, 5632}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {2048, 5632}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16

### Test GGUF Model Locally

In [30]:
# Before uploading, let's test the GGUF file we just created.
# This proves the final file works as expected.

print("ü§ñ Loading GGUF model for testing...")

# Load the GGUF model
# n_gpu_layers=0 means use CPU only, which is how it will run on Spaces.
# This is a good way to double-check.
llm = Llama(
    model_path=output_gguf_file,
    n_ctx=512,      # Context window
    n_threads=os.cpu_count(),
    n_gpu_layers=0  # Use 0 to test on CPU
)

print("‚úÖ GGUF model loaded. Generating test quote...")

# Create the prompt in the same format we trained on
test_keyword = "life"
prompt = f"Keyword: {test_keyword}\nQuote:"

# Generate the completion
output = llm.create_completion(
    prompt,
    max_tokens=80,
    temperature=0.7,
    stop=["\n", "Keyword:"], # Stop at a newline or if it tries to start a new entry
    echo=False # Don't print our prompt back to us
)

generated_text = output["choices"][0]["text"].strip()

print("\n--- üìù GGUF TEST RESULT ---")
print(f"Keyword: {test_keyword}")
print(f"Generated Quote: {generated_text}")
print("----------------------------")
print("‚úÖ Test complete. If the quote looks good, proceed to the next cell!")

llama_model_loader: loaded meta data with 32 key-value pairs and 201 tensors from tinyllama-quotes-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Merged_Model
llama_model_loader: - kv   3:                         general.size_label str              = 1.1B
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                       llama.context_length u32              = 2048
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   7:                  llama.feed_forward_length u32 

ü§ñ Loading GGUF model for testing...


load_tensors:   CPU_REPACK model buffer size =   455.06 MiB
load_tensors:   CPU_Mapped model buffer size =   629.99 MiB
repack: repack tensor blk.0.attn_q.weight with q4_K_8x8
repack: repack tensor blk.0.attn_k.weight with q4_K_8x8
repack: repack tensor blk.0.attn_output.weight with q4_K_8x8
repack: repack tensor blk.0.ffn_gate.weight with q4_K_8x8
.repack: repack tensor blk.0.ffn_up.weight with q4_K_8x8
.repack: repack tensor blk.1.attn_q.weight with q4_K_8x8
.repack: repack tensor blk.1.attn_k.weight with q4_K_8x8
repack: repack tensor blk.1.attn_output.weight with q4_K_8x8
repack: repack tensor blk.1.ffn_gate.weight with q4_K_8x8
.repack: repack tensor blk.1.ffn_up.weight with q4_K_8x8
.repack: repack tensor blk.2.attn_q.weight with q4_K_8x8
repack: repack tensor blk.2.attn_k.weight with q4_K_8x8
repack: repack tensor blk.2.attn_v.weight with q4_K_8x8
repack: repack tensor blk.2.attn_output.weight with q4_K_8x8
.repack: repack tensor blk.2.ffn_gate.weight with q4_K_8x8
.repack: repa

‚úÖ GGUF model loaded. Generating test quote...


llama_perf_context_print:        load time =     785.43 ms
llama_perf_context_print: prompt eval time =     785.27 ms /     9 tokens (   87.25 ms per token,    11.46 tokens per second)
llama_perf_context_print:        eval time =    6033.36 ms /    54 runs   (  111.73 ms per token,     8.95 tokens per second)
llama_perf_context_print:       total time =    6850.01 ms /    63 tokens
llama_perf_context_print:    graphs reused =         52



--- üìù GGUF TEST RESULT ---
Keyword: life
Generated Quote: ‚ÄúThe only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it.‚Äù - Steve Jobs
----------------------------
‚úÖ Test complete. If the quote looks good, proceed to the next cell!
