# Evaluating Synthetic Data for LLM Fine-Tuning

This notebook tests whether **synthetic data can be used effectively for LLM fine-tuning** by comparing model performance trained on real vs. synthetic data.

Key Steps:  
- Fine-tune an **LLM on synthetic data** (`generated_sequences_no_dp.jsonl`).
- Evaluate model performance on **classification and language modeling tasks**.
- Compare the effectiveness of models trained on **real vs. synthetic datasets**.
- Assess whether synthetic data can replace real data **without significant performance loss**.

This evaluation determines whether **synthetic data can be a viable replacement for real data in model training**.


In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb
!pip install unsloth

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Model training for DP and NonDP finetuned data

In [None]:
import json
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from peft import LoraConfig, get_peft_model
from unsloth import FastLanguageModel
import torch


max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

# --- Step 1: Load the LLaMA-8B model and tokenizer in 8-bit mode ---
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Add a distinct pad token if not already set or if it matches the EOS token.
if tokenizer.pad_token is None or tokenizer.pad_token == tokenizer.eos_token:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import is_bfloat16_supported


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# --- Step 2: Attach LoRA adapters for fine-tuning ---
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.3.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
import json
from datasets import Dataset

# --- Step 3: Process the JSONL data to build prompts ---
file_path = "/content/generated_sequences_no_dp.jsonl"
max_records = 10000
prompts = []

with open(file_path, "r") as f:
    for j, line in enumerate(f):
        if j >= max_records:
            break
        record = json.loads(line.strip())
        gen_text = record.get("generated_text", "")
        gen_text = gen_text.strip('"')  # Remove extra outer quotes if present

        # Split the generated text using " | " as delimiter
        parts = gen_text.split(" | ")
        rating = None
        review = None
        for part in parts:
            part = part.strip()
            if part.startswith("Review Rating:"):
                rating = part.split(":", 1)[1].strip()
            elif part.startswith("Review:"):
                review = part.split(":", 1)[1].strip()
        # Only add if both rating and review are found
        if rating is not None and review is not None:
            prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review.

### Input:
{review}

### Rating:
{rating}"""
            # Append the end-of-text token
            prompt = prompt + " <|end_of_text|>"
            prompts.append(prompt)

# Create a dataset with a "text" field as required by SFTTrainer.
data_dict = {"text": prompts}
dataset = Dataset.from_dict(data_dict)


In [None]:
dataset[3]

{'text': "You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review.\n\n### Input:\nWent to return this item, but changed my mind. Thought it may have been over kill for a business trip. I was wrong. On a full charge, both phone and pack, I had power all day and didn't put it back on my charger till morning, but I bet I probably could have stretched it out till noon before my phone would have died.<br />Beyond glad I kept it.<br />I did drop my phone a few times and the battery popped off but there was no real damage and it just popped back on without issue and kept working.<br /><br />One con. Dont get it to close to your laptop. Not sure how, must me the magnet inside the pack, but it puts my computer into hibernate mode. Took me a few times to figure that the pack was doing it to my computer when I rested near my mouse 

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 128,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # max_steps = 60,
        num_train_epochs = 4, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Tokenizing to ["text"] (num_proc=8):   0%|          | 0/9998 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,998 | Num Epochs = 4 | Total steps = 76
O^O/ \_/ \    Batch size per device = 128 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (128 x 4 x 1) = 512
 "-____-"     Trainable parameters = 20,971,520/4,561,571,840 (0.46% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.4995
20,1.0291
30,0.8793
40,0.9371
50,0.837
60,0.9139
70,0.8287


In [None]:
print("EOS token:", tokenizer.eos_token)
print("PAD token:", tokenizer.pad_token)
print("EOS token id:", tokenizer.eos_token_id)
print("PAD token id:", tokenizer.pad_token_id)

EOS token: <|end_of_text|>
PAD token: <|reserved_special_token_250|>
EOS token id: 128001
PAD token id: 128255


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.168 GB.
18.652 GB of memory reserved.


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

10993.6196 seconds used for training.
183.23 minutes used for training.
Peak reserved memory = 18.652 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 84.139 %.
Peak reserved memory for training % of max memory = 0.0 %.


In [None]:
dataset[0]

{'text': "You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review.\n\n### Input:\nLooks and works great. It was a little little on the loose fitting side but now it's fine. I've dropped my phone quite a bit and my phone has come out fine. I have a tempered glass screen protector on it and I'm pretty sure that's what saved my phone. I don't think this case would have protected it. I'm not sure how well it would have protected the camera on the back of the phone. It is a little bit loose and I've had it come off a few times. I haven't had it fall off yet though. I would recommend this case. It's a great price for a cute case that gives you some protection. I would recommend getting a tempered glass screen protector if you don't have one already. It will give you more protection. I'm not sure how\n\n### Rating:\n4 <|end_of_

# Inference

In [None]:
import torch
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# --- Step 1: Load your fine-tuned model and tokenizer ---
# Here we assume you saved your fine-tuned model in the "outputs" directory.
model_name_or_path = "/content/outputs/checkpoint-76"  # Adjust if necessary
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name_or_path,
    max_seq_length=max_seq_length,
    dtype=None,           # Auto-detect dtype
    load_in_4bit=True,     # Ensure consistency with your training config
)

# Set the model to evaluation mode.
model.eval()

# --- Step 2: Define a function for inference ---
def generate_rating(review: str, max_gen_length: int = 50) -> str:
    # Build a prompt with the review as the input; note the rating is left empty.
    prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
{review}

### Rating:
"""
    # Tokenize the prompt and move inputs to the model's device.
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    # Generate output from the model.
    outputs = model.generate(
        input_ids,
        max_length=input_ids.shape[1] + max_gen_length,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,  # Ensure generation stops correctly.
        eos_token_id=tokenizer.eos_token_id,
        )
    # Decode the generated tokens into text.
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# --- Step 3: Run inference on an example review ---
example_review = "The product quality was outstanding and the delivery was fast, but the packaging could be improved."
result = generate_rating(example_review)
print("Generated result:\n", result)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Generated result:
 You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
The product quality was outstanding and the delivery was fast, but the packaging could be improved.

### Rating:
5 


In [None]:
dataset[0]

{'text': "You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review.\n\n### Input:\nLooks and works great. It was a little little on the loose fitting side but now it's fine. I've dropped my phone quite a bit and my phone has come out fine. It's not too bulky and the pink is so cute. I'd buy this again if I needed to. The only thing is that it does not have a cut out for the fingerprint reader, so I can't use that feature. But I've gotten used to using the facial recognition. I'm not sure if it's the case or the phone but the sound is not as loud as it was before I got the case. But I'm not sure if that's the case or the phone. But I love this case. It's cute and it's very protective. I highly recommend this case. I'm very happy with it.\n\n### Rating:\n4 <|end_of_text|>"}

In [None]:
tokenizer.eos_token

'<|end_of_text|>'

In [None]:
tokenizer.pad_token

'<|reserved_special_token_250|>'

In [None]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer

# Enable optimized inference
FastLanguageModel.for_inference(model)

# Replace the Fibonacci message with your review classification prompt.
review_text = "The product quality was good and the delivery was fast, but the packaging could be improved."
prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Provide only a single digit as your answer.

### Input:
{review_text}

### Rating:"""

# Tokenize the prompt (ensuring attention masks are created)
input_ids = tokenizer(prompt, return_tensors="pt", padding="longest", truncation=True).input_ids.to("cuda")

# Set up a TextStreamer to stream the generated tokens (skipping the prompt)
text_streamer = TextStreamer(tokenizer)

# Generate the model's output, limiting generation to a few tokens
_ = model.generate(
    input_ids,
    streamer=text_streamer,
    max_new_tokens=10,
    pad_token_id=tokenizer.pad_token_id  # or tokenizer.eos_token_id if pad_token_id not set
)


<|begin_of_text|>You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Provide only a single digit as your answer.

### Input:
The product quality was good and the delivery was fast, but the packaging could be improved.

### Rating: 3 <|end_of_text|>


In [None]:
# ##TRY THIS


# import torch
# import json
# from unsloth import FastLanguageModel
# from transformers import AutoTokenizer

# # --- Step 1: Load your fine-tuned model and tokenizer ---
# model_name_or_path = "/content/outputs/checkpoint-76"  # Adjust if necessary
# max_seq_length = 2048

# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name=model_name_or_path,
#     max_seq_length=max_seq_length,
#     dtype=None,           # Auto-detect dtype
#     load_in_4bit=True,     # Ensure consistency with your training config
# )

# # Set the model to evaluation mode.
# model.eval()

# # --- Step 2: Define a function for inference ---
# def generate_rating(review: str, max_gen_length: int = 50) -> str:
#     # Build a prompt with the review as the input; note the rating is left empty.
#     prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

# ### Input:
# {review}

# ### Rating:
# """
#     # Tokenize the prompt and move inputs to the model's device.
#     input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

#     # Generate output from the model.
#     outputs = model.generate(
#         input_ids,
#         max_length=input_ids.shape[1] + max_gen_length,
#         do_sample=True,
#         temperature=0.7,
#         top_p=0.95,
#         pad_token_id=tokenizer.pad_token_id,  # Ensure generation stops correctly.
#         eos_token_id=tokenizer.eos_token_id,
#     )
#     # Decode the generated tokens into text.
#     generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
#     return generated_text

# # --- Step 3: Read reviews from test.jsonl and perform inference ---
# test_file_path = "/content/test.jsonl"
# results = []

# with open(test_file_path, "r") as f:
#     for i, line in enumerate(f):
#         record = json.loads(line)
#         review = record.get("Review")
#         if review is not None:
#             rating_prediction = generate_rating(review)
#             results.append({
#                 "review": review,
#                 "generated_rating": rating_prediction
#             })
#             print(f"Record {i}:")
#             print("Review:", review)
#             print("Generated Rating:", rating_prediction)
#             print("-" * 50)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Record 583:
Review: This was returned.
Generated Rating: You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
This was returned.

### Rating:
1 
--------------------------------------------------
Record 584:
Review: Bought for a neighbor, she got a new I phone.  She liked mine, which is red.  We both love the clasp that locks the covers.
Generated Rating: You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
Bought for a neighbor, she got a new I phone.  She liked mine, which is 

KeyboardInterrupt: 

# No DP accuracy

In [None]:
import torch
import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# --- Step 1: Load your fine-tuned model and tokenizer ---
model_name_or_path = "/content/outputs/checkpoint-76"  # Adjust if necessary
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name_or_path,
    max_seq_length=max_seq_length,
    dtype=None,           # Auto-detect dtype
    load_in_4bit=True,     # Ensure consistency with your training config
)

# Set the model to evaluation mode.
model.eval()

# --- Step 2: Define a function for inference ---
def generate_rating(review: str, max_gen_length: int = 50) -> str:
    # Build a prompt with the review as the input; note the rating is left empty.
    prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
{review}

### Rating:
"""
    # Tokenize the prompt and move inputs to the model's device.
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    prompt_length = input_ids.shape[1]

    # Generate output from the model.
    outputs = model.generate(
        input_ids,
        max_length=prompt_length + max_gen_length,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,  # Ensure generation stops correctly.
        eos_token_id=tokenizer.eos_token_id,
    )
    # Slice the output to only get tokens generated after the prompt.
    generated_tokens = outputs[0][prompt_length:]
    generated_response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return generated_response.strip()


# --- Step 3: Read reviews from test.jsonl, perform inference, and compute accuracy ---
test_file_path = "/content/test.jsonl"
results = []
num_correct = 0
total = 0
total_error = 0
with open(test_file_path, "r") as f:
    for i, line in enumerate(f):
        if i >= 1000:
            break
        record = json.loads(line)
        review = record.get("Review")
        actual_rating = record.get("Rating")
        if review is not None and actual_rating is not None:
            rating_prediction = generate_rating(review)
            # print(rating_prediction)
            # Try to extract the predicted rating as an integer.
            try:
                predicted_rating = int(rating_prediction.strip())
            except Exception:
                predicted_rating = None
                total_error += 1
            # Append to results
            results.append({
                "review": review,
                "actual_rating": actual_rating,
                "predicted_rating": predicted_rating,
                "raw_generated": rating_prediction
            })
            print(f"Record {i}:")
            print("Review:", review)
            print("Actual Rating:", actual_rating)
            print("Generated Rating (raw):", rating_prediction)
            print("Predicted Rating (parsed):", predicted_rating)
            print("-" * 50)
            if predicted_rating is not None and predicted_rating == actual_rating:
                num_correct += 1
            total += 1
print(total_error)
accuracy = num_correct / total if total > 0 else 0
print(f"Accuracy over {total} records: {accuracy * 100:.2f}%")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Record 167:
Review: MY J3 2018 LOVES THIS CASE. THE FOLDING COVER PROTECTS THE SCREEN, AND IT'S STILL SMALL ENOUGH TO BE CARRIED IN MY SHIRT POCKET IF NEEDED. A GREAT PRODUCT AND VALUE.
Actual Rating: 5
Generated Rating (raw): 5
Predicted Rating (parsed): 5
--------------------------------------------------
Record 168:
Review: This is the first time my screen protector saved my screen. I dropped my phone on concrete and the protector cracked diagonally in two spots. Buy my phone screen?  Not a blemish. After taking of the Ailun protector the phones screen was like new. Thanks. I am placing another order to have back ups.
Actual Rating: 5
Generated Rating (raw): 5
Predicted Rating (parsed): 5
--------------------------------------------------
Record 169:
Review: The water test was successful.  I put my iPhone 5s in it and it just does not seem a tight seal.  I can still squeeze it and get give from the case.  I am afraid t

# DP accuracy

In [None]:
import torch
import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# --- Step 1: Load your fine-tuned model and tokenizer ---
model_name_or_path = "/content/outputs/1-checkpoint-76-with-dp"  # Adjust if necessary
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name_or_path,
    max_seq_length=max_seq_length,
    dtype=None,           # Auto-detect dtype
    load_in_4bit=True,     # Ensure consistency with your training config
)

# Set the model to evaluation mode.
model.eval()

# --- Step 2: Define a function for inference ---
def generate_rating(review: str, max_gen_length: int = 50) -> str:
    # Build a prompt with the review as the input; note the rating is left empty.
    prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
{review}

### Rating:
"""
    # Tokenize the prompt and move inputs to the model's device.
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    prompt_length = input_ids.shape[1]

    # Generate output from the model.
    outputs = model.generate(
        input_ids,
        max_length=prompt_length + max_gen_length,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,  # Ensure generation stops correctly.
        eos_token_id=tokenizer.eos_token_id,
    )
    # Slice the output to only get tokens generated after the prompt.
    generated_tokens = outputs[0][prompt_length:]
    generated_response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return generated_response.strip()


# --- Step 3: Read reviews from test.jsonl, perform inference, and compute accuracy ---
test_file_path = "/content/test.jsonl"
results = []
num_correct = 0
total = 0
total_error = 0
with open(test_file_path, "r") as f:
    for i, line in enumerate(f):
        if i >= 1000:
            break
        record = json.loads(line)
        review = record.get("Review")
        actual_rating = record.get("Rating")
        if review is not None and actual_rating is not None:
            rating_prediction = generate_rating(review)
            # print(rating_prediction)
            # Try to extract the predicted rating as an integer.
            try:
                predicted_rating = int(rating_prediction.strip())
            except Exception:
                predicted_rating = None
                total_error += 1
            # Append to results
            results.append({
                "review": review,
                "actual_rating": actual_rating,
                "predicted_rating": predicted_rating,
                "raw_generated": rating_prediction
            })
            print(f"Record {i}:")
            print("Review:", review)
            print("Actual Rating:", actual_rating)
            print("Generated Rating (raw):", rating_prediction)
            print("Predicted Rating (parsed):", predicted_rating)
            print("-" * 50)
            if predicted_rating is not None and predicted_rating == actual_rating:
                num_correct += 1
            total += 1
print(total_error)
accuracy = num_correct / total if total > 0 else 0
print(f"Accuracy over {total} records: {accuracy * 100:.2f}%")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Record 167:
Review: MY J3 2018 LOVES THIS CASE. THE FOLDING COVER PROTECTS THE SCREEN, AND IT'S STILL SMALL ENOUGH TO BE CARRIED IN MY SHIRT POCKET IF NEEDED. A GREAT PRODUCT AND VALUE.
Actual Rating: 5
Generated Rating (raw): 5
Predicted Rating (parsed): 5
--------------------------------------------------
Record 168:
Review: This is the first time my screen protector saved my screen. I dropped my phone on concrete and the protector cracked diagonally in two spots. Buy my phone screen?  Not a blemish. After taking of the Ailun protector the phones screen was like new. Thanks. I am placing another order to have back ups.
Actual Rating: 5
Generated Rating (raw): 5
Predicted Rating (parsed): 5
--------------------------------------------------
Record 169:
Review: The water test was successful.  I put my iPhone 5s in it and it just does not seem a tight seal.  I can still squeeze it and get give from the case.  I am afraid t

# Finetuning model on actual data

In [None]:
from unsloth import FastLanguageModel
import unsloth
from unsloth import is_bfloat16_supported
import json
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model
import torch


max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True

# --- Step 1: Load the LLaMA-8B model and tokenizer in 8-bit mode ---
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Add a distinct pad token if not already set or if it matches the EOS token.
if tokenizer.pad_token is None or tokenizer.pad_token == tokenizer.eos_token:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.9: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# --- Step 2: Attach LoRA adapters for fine-tuning ---
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.3.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
import json
from datasets import Dataset

# --- Step 3: Process the JSONL data to build prompts ---
file_path = "/content/train.jsonl"
max_records = 10000
prompts = []

with open(file_path, "r") as f:
    for j, line in enumerate(f):
        if j >= max_records:
            break
        record = json.loads(line.strip())
        # Extract only the review text and rating, ignoring other fields.
        review = record.get("Review", "").strip()
        rating = record.get("Rating")
        # Only add if both rating and review are found.
        if rating is not None and review:
            prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review.

### Input:
{review}

### Rating:
{rating}"""
            # Append the end-of-text token.
            prompt = prompt + " <|end_of_text|>"
            prompts.append(prompt)

# Create a dataset with a "text" field as required by SFTTrainer.
data_dict = {"text": prompts}
dataset = Dataset.from_dict(data_dict)


In [None]:
dataset[0]

{'text': 'You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review.\n\n### Input:\nI bought this bc I thought it had the nice white background. Turns out it’s clear & since my phone is blue it doesn’t look anything like this.  If I had known that I would have purchased something else. It works ok.\n\n### Rating:\n4 <|end_of_text|>'}

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 32,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # max_steps = 60,
        num_train_epochs = 4, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Tokenizing to ["text"] (num_proc=8):   0%|          | 0/10000 [00:00<?, ? examples/s]

In [6]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 4 | Total steps = 312
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 4 x 1) = 128
 "-____-"     Trainable parameters = 20,971,520/4,561,571,840 (0.46% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.2127
20,1.4808
30,1.4671
40,1.4333
50,1.3898
60,1.4175
70,1.3851
80,1.3876
90,1.3615
100,1.3495


In [2]:
import torch
import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# --- Step 1: Load your fine-tuned model and tokenizer ---
model_name_or_path = "/content/outputs/checkpoint-312"  # Adjust if necessary
max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name_or_path,
    max_seq_length=max_seq_length,
    dtype=None,           # Auto-detect dtype
    load_in_4bit=True,     # Ensure consistency with your training config
)

# Set the model to evaluation mode.
model.eval()

# --- Step 2: Define a function for inference ---
def generate_rating(review: str, max_gen_length: int = 50) -> str:
    # Build a prompt with the review as the input; note the rating is left empty.
    prompt = f"""You are an expert classifier tasked with determining the rating of an Amazon product review. For each review provided, assign a numerical rating from 1 to 5, where 1 is a very negative review and 5 is a very positive review. Dont give anything other than the rating.

### Input:
{review}

### Rating:
"""
    # Tokenize the prompt and move inputs to the model's device.
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    prompt_length = input_ids.shape[1]

    # Generate output from the model.
    outputs = model.generate(
        input_ids,
        max_length=prompt_length + max_gen_length,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,  # Ensure generation stops correctly.
        eos_token_id=tokenizer.eos_token_id,
    )
    # Slice the output to only get tokens generated after the prompt.
    generated_tokens = outputs[0][prompt_length:]
    generated_response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return generated_response.strip()


# --- Step 3: Read reviews from test.jsonl, perform inference, and compute accuracy ---
test_file_path = "/content/test.jsonl"
results = []
num_correct = 0
total = 0
total_error = 0
with open(test_file_path, "r") as f:
    for i, line in enumerate(f):
        if i >= 1000:
            break
        record = json.loads(line)
        review = record.get("Review")
        actual_rating = record.get("Rating")
        if review is not None and actual_rating is not None:
            rating_prediction = generate_rating(review)
            # print(rating_prediction)
            # Try to extract the predicted rating as an integer.
            try:
                predicted_rating = int(rating_prediction.strip())
            except Exception:
                predicted_rating = None
                total_error += 1
            # Append to results
            results.append({
                "review": review,
                "actual_rating": actual_rating,
                "predicted_rating": predicted_rating,
                "raw_generated": rating_prediction
            })
            print(f"Record {i}:")
            print("Review:", review)
            print("Actual Rating:", actual_rating)
            print("Generated Rating (raw):", rating_prediction)
            print("Predicted Rating (parsed):", predicted_rating)
            print("-" * 50)
            if predicted_rating is not None and predicted_rating == actual_rating:
                num_correct += 1
            total += 1
print(total_error)
accuracy = num_correct / total if total > 0 else 0
print(f"Accuracy over {total} records: {accuracy * 100:.2f}%")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Record 167:
Review: MY J3 2018 LOVES THIS CASE. THE FOLDING COVER PROTECTS THE SCREEN, AND IT'S STILL SMALL ENOUGH TO BE CARRIED IN MY SHIRT POCKET IF NEEDED. A GREAT PRODUCT AND VALUE.
Actual Rating: 5
Generated Rating (raw): 5
Predicted Rating (parsed): 5
--------------------------------------------------
Record 168:
Review: This is the first time my screen protector saved my screen. I dropped my phone on concrete and the protector cracked diagonally in two spots. Buy my phone screen?  Not a blemish. After taking of the Ailun protector the phones screen was like new. Thanks. I am placing another order to have back ups.
Actual Rating: 5
Generated Rating (raw): 5
Predicted Rating (parsed): 5
--------------------------------------------------
Record 169:
Review: The water test was successful.  I put my iPhone 5s in it and it just does not seem a tight seal.  I can still squeeze it and get give from the case.  I am afraid t