<a href="https://colab.research.google.com/github/bithack07/TinyLLamaFineTune/blob/main/TinyLlamaTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 1. Install necessary libraries (run in terminal):
!pip install transformers datasets torch



In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import Dataset
import json

In [3]:
# Load TinyLLaMA model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [10]:
# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Enable gradient checkpointing
#model.gradient_checkpointing_enable()



In [11]:
# Load and preprocess dataset
import json
from datasets import Dataset

with open('data.json', 'r') as f:
    data = json.load(f)

system_prompt = "You are an assistant that answers questions based on the given context from Oracle Utilities documentation."
texts = [
    f"[INST] <<SYS>> {system_prompt} <</SYS>> Context: {item['section']} Question: {item['instruction']} [/INST] {item['response']}"
    for item in data
]

dataset = Dataset.from_dict({'text': texts})

In [12]:
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text'])

# Set up data collator
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Clear GPU memory
torch.cuda.empty_cache()

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [13]:
# Configure training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
)

In [14]:
# Initialize and run trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,3.9581
20,3.8021
30,3.5824
40,3.4397
50,3.2247
60,3.1173
70,2.9384
80,2.5807
90,2.3371
100,2.1136


TrainOutput(global_step=186, training_loss=2.5424998344913607, metrics={'train_runtime': 141.8373, 'train_samples_per_second': 5.245, 'train_steps_per_second': 1.311, 'total_flos': 418632123101184.0, 'train_loss': 2.5424998344913607, 'epoch': 3.0})

In [15]:
# Save the fine-tuned model
model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_model')

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/tokenizer.model',
 './fine_tuned_model/added_tokens.json',
 './fine_tuned_model/tokenizer.json')

In [None]:
from google.colab import drive
drive.mount('/content/drive')


model.save_pretrained('/content/drive/My Drive/my_model.h6')

Mounted at /content/drive


**Retrival**

In [21]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, './fine_tuned_model')

In [17]:
def generate_response(instruction):
    # Format chat messages
    chat = [
        {"role": "system", "content": "You are a helpful assistant that only provides information from your training data about Oracle. If a question is outside your training or you're unsure, respond with 'I don't have information about that in my training data.'"},
        {"role": "user", "content": instruction}
    ]

    # Apply chat template
    input_text = tokenizer.apply_chat_template(chat, tokenize=False)

    # Tokenize input
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    # Generate response with conservative parameters
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_length=256,          # Reduced to prevent rambling
            temperature=0.1,         # Very low for deterministic outputs
            top_p=0.1,              # Very selective sampling
            top_k=10,               # Limit to top 10 tokens
            do_sample=True,         # Keep True for temperature to work
            repetition_penalty=1.2,  # Prevent repetitive text
            no_repeat_ngram_size=3, # Prevent repeating phrases
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id
        )

    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Extract response
    response = output_text.split('[/INST]')[-1].strip()

    return response, output_text

In [20]:
# Example usage
instruction = "Do all sources use the same file format for the MR002 batch process?"
response, full_output = generate_response(instruction)
#print("Full Output:", full_output)
print("Response:", response)

NameError: name 'device' is not defined

In [22]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Create a copy of the base model for comparison
pure_base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load PEFT adapter for fine-tuned model
fine_tuned_model = PeftModel.from_pretrained(base_model, './fine_tuned_model')

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Function to generate responses with either model
def generate_response(instruction, use_fine_tuned=True):
    # Select model
    model = fine_tuned_model if use_fine_tuned else pure_base_model

    # Format chat messages
    if use_fine_tuned:
        system_prompt = "Only provide information which is in Oracle context. If it is unrelated, say you don't know anything"
    else:
        system_prompt = "You are a helpful assistant."

    chat = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": instruction}
    ]

    # Apply chat template
    input_text = tokenizer.apply_chat_template(chat, tokenize=False)

    # Tokenize input
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    # Generate response
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_length=512,
            temperature=0.7 if not use_fine_tuned else 0.1,
            top_p=0.9 if not use_fine_tuned else 0.1,
            do_sample=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id
        )

    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Extract response
    response = output_text.split('[/INST]')[-1].strip()

    return response, output_text



In [23]:
# Test both models
test_questions = [
    "How often does the MR002 batch process run?"
]

print("Comparing Base Model vs Fine-tuned Model:")
print("=" * 50)

for question in test_questions:
    print(f"Question: {question}")

    # Base model response
    base_response, _ = generate_response(question, use_fine_tuned=False)
    print(f"Base Model: {base_response}")

    print("-" * 50)
    # Fine-tuned model response
    ft_response, _ = generate_response(question, use_fine_tuned=True)
    print(f"Fine-tuned: {ft_response}")



The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Comparing Base Model vs Fine-tuned Model:
Question: How often does the MR002 batch process run?
Base Model: <|system|>
You are a helpful assistant. 
<|user|>
How often does the MR002 batch process run? 
<|assistant|>
The MR002 batch process is a continuous process and runs continuously throughout the day, 24 hours a day. The batch process is designed to produce high-quality, consistent batches of MR002. The batch process ensures that the MR002 is produced consistently, with minimal variation in particle size, shape, and density. The MR002 batch process is designed to run continuously and is not disrupted by human intervention.
--------------------------------------------------
Fine-tuned: <|system|>
Only provide information which is in Oracle context. If it is unrelated, say you don't know anything 
<|user|>
How often does the MR002 batch process run? 
<|assistant|>
The MR002 batch process is a batch process that runs continuously. It is not specified in the given text whether it runs 