<h1 align="center">CarDekho Q&A Fine-Tuning using TinyLLaMA</h1>

<h2> Objective</h2>
<p>
This project focuses on <b>fine-tuning the TinyLLaMA language model</b> using a domain-specific dataset from the CarDekho platform.
The goal is to train a lightweight, efficient model that can answer <b>car-related questions</b> such as pricing, features, specifications, and other automobile queries.
</p>

<h2> Project Workflow</h2>

<ul>
  <li><b>Data Collection:</b> Use CarDekho dataset containing car specs, prices, and features.</li>
  <li><b>Preprocessing:</b> Clean and structure the data into instruction-response pairs.</li>
  <li><b>Model Fine-Tuning:</b> Fine-tune TinyLLaMA on the car Q&A dataset using Google Colab and 4-bit quantization (BitsAndBytes).</li>
  <li><b>Evaluation:</b> Test the model's ability to answer car-related queries accurately.</li>
  <li><b>Deployment (optional):</b> Save and share the fine-tuned model for future use.</li>
</ul>

<h2>Technologies & Tools</h2>

<ul>
  <li>Google Colab</li>
  <li>Hugging Face Transformers</li>
  <li>BitsAndBytes (4-bit quantization)</li>
  <li>PEFT (Parameter-Efficient Fine-Tuning)</li>
  <li>Python</li>
</ul>

<h2> Dataset</h2>

<p>
The dataset includes car specifications, pricing, and features sourced from CarDekho.
The data is converted into a <b>Instruction-Response JSONL format</b> for fine-tuning.
</p>


In [1]:
import pandas as pd
import os
import torch

<h3>Device Setup</h3>

<pre>
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
</pre>

<p><b>Output Example:</b></p>

<ul>
  <li>If GPU is available: <code>Using device: cuda</code></li>
  <li>If GPU is not available: <code>Using device: cpu</code></li>
</ul>


In [2]:
device ="cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [3]:
# ------Laod Your Preprocess dataset here--------

df=pd.read_csv("/content/preprocessing_dataset.csv")

<h2>Data Preparation & Question Generation</h2>

<p style="font-size: 16px;">
In this step, we generated <b>instruction-response pairs</b> from the CarDekho dataset to fine-tune TinyLLaMA for <b>car-related Q&A tasks</b>.
</p>

<ul style="font-size: 16px;">
  <li><b>Random Question Templates:</b> Different question formats were created using car details such as model, year, price, fuel type, and more.</li>
  <li><b>Dynamic Data Filling:</b> Each car's information was inserted into these templates to generate personalized questions.</li>
  <li><b>Response Generation:</b> Detailed answers were generated using car specifications like listed price, kilometers driven, fuel type, transmission type, and ownership details.</li>
  <li><b>Fine-Tuning Format:</b> The data was formatted in <b>Instruction-Response style</b> suitable for TinyLLaMA fine-tuning:</li>
</ul>

<pre style="font-size: 15px;">
### Instruction:
What is the market price of the 2019 Hyundai Santro?

### Response:
The 2019 Hyundai Santro is listed at 300000.0. It has driven 25000 km, runs on petrol, has a manual transmission, and is a first owner vehicle.
</pre>

<p style="font-size: 16px;">
This structured Q&A dataset allows the model to learn <b>domain-specific knowledge about cars</b> and answer user queries more accurately.
</p>


In [5]:
# import random
# questions_templates = [
#     ("what is the market price of a {myear} {brand} {model}?"),
#     # ("How many kilometers has the {myear} {oem} {model} been driven?"),
#     ("What is the wheel base (in mm) of the {myear} {oem} {model}?", "Wheel Base"),
#     ("what is the top speed of the {myear} {oem} {model}?","Top Speed"),
#     ("At what RPM does the {myear} {oem} {model} deliver maximum torque?", "Max Torque At"),
# ]
# formatted_samples = []

# for _, row in df.iterrows():
#     brand = str(row.get("oem", "Unknown"))
#     model = str(row.get("model", "Unknown"))
#     year = str(row.get("myear", "Unknown"))
#     price = str(row.get("listed_price", "Unknown"))
#     fuel = str(row.get("fuel", "Unknown"))
#     transmission = str(row.get("transmission", "Unknown"))
#     owner = str(row.get("owner_type", "Unknown"))
#     km_driven = str(row.get("km", "Unknown"))

#     template = random.choice(questions_templates)
#     # insert the actual car's details into the selected questions
#     question = template.format(brand=brand, model=model, year=year)

#     # Create a response/answer
#     answer = (
#         f"The {year} {brand} {model} is listed at {price}. "
#         f"It has driven {km_driven} km, runs on {fuel}, has a {transmission} transmission, "
#         f"and is a {owner} owner vehicle."
#     )

#     # Format for TinyLLaMA
#     formatted_text = f"### Instruction:\n{question}\n\n### Response:\n{answer}\n\n"
#     formatted_samples.append(formatted_text)


In [2]:
# import json

# jsonl_data = []

# for _, row in df.iterrows():
#     for template, feature in questions_templates:
#         instruction = template.format(
#             myear=row['myear'], oem=row['oem'], model=row['model']
#         )
#         answer = str(row.get(feature, "Information not available"))

#         jsonl_data.append({
#             "instruction": instruction,
#             "response": answer
#         })

In [None]:
# # save formatted samples to text file
# with open("car_data_instructions.txt", "w", encoding="utf-8") as f:
#     f.writelines(formatted_samples)


In [None]:
# from datasets import Dataset

# # Load lines from your text file
# with open("/content/car_data_instructions.txt", "r", encoding="utf-8") as f:
#     lines = f.readlines()

# # Create Hugging Face dataset
# dataset = Dataset.from_dict({"text": lines})
# dataset = dataset[:1000]

In [None]:
#  dataset

Theeee mainnnnn codeeeeeeee

In [None]:
import json

input_file = "/content/car_data_instructions.txt"  # your input .txt file
output_file = "/content/car_data_instructions.txt.jsonal"

with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
    lines = infile.read().split("### Instruction:")
    for entry in lines:
        if "### Response:" in entry:
            try:
                instruction_part, response_part = entry.split("### Response:")
                instruction = instruction_part.strip().replace("\n", " ")
                response = response_part.strip().replace("\n", " ")

                json_obj = {
                    "instruction": instruction,
                    "output": response
                }

                json.dump(json_obj, outfile)
                outfile.write("\n")

            except ValueError:
                # Skip malformed entries
                continue

print(f"Converted and saved to {output_file}")


Converted and saved to /content/car_data_instructions.txt.jsonal


<h2> Model Loading (Step-1)</h2>

<p style="font-size: 16px;">
In this step, we loaded the <b>TinyLLaMA pre-trained model</b> and its tokenizer from the Hugging Face Hub.
TinyLLaMA is a lightweight LLM suitable for fine-tuning and deployment in resource-constrained environments.
</p>

<ul style="font-size: 16px;">
  <li><b>Model Used:</b> <code>TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T</code></li>
  <li><b>Tokenizer:</b> The tokenizer converts text into model-readable tokens and handles padding, special tokens, etc.</li>
  <li><b>Model:</b> The pre-trained <b>causal language model</b> was loaded to prepare it for further fine-tuning.</li>
  <li><b>Device:</b> You can optionally move the model to GPU using <code>.to(device)</code> for faster training (in Colab, CUDA is usually available).</li>
</ul>



<p style="font-size: 16px;">
This step prepares the model and tokenizer for the fine-tuning pipeline.
</p>


In [6]:
# step-1
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
tokenizer = AutoTokenizer.from_pretrained(model_name)#.to(device)
model = AutoModelForCausalLM.from_pretrained(model_name)#, load_in_4bit=True)#.to(device)#,

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [None]:
import json
from datasets import Dataset

# Load valid lines from JSONL and skip malformed ones
json_list = []
with open("/content/car_qa_clean.jsonl", "r", encoding="utf-8") as json_file:
    for line in json_file:
        try:
            json_list.append(json.loads(line))
        except json.JSONDecodeError:
            continue  # Skip bad lines

# Limit to first 1000 valid entries and create dataset
json_list = json_list[:1000]
dataset = Dataset.from_list(json_list)


In [None]:
dataset['instruction'][:5]

['What is the market price of the 2016 Maruti Wagon R?',
 'How many kilometers has the 2016 Maruti Wagon R been driven?',
 'What is the wheel base (in mm) of the 2016 Maruti Wagon R?',
 'What is the top speed of the 2016 Maruti Wagon R?',
 'At what RPM does the 2016 Maruti Wagon R deliver maximum torque?']


<h3 style="font-size: 20px;">📌 Why Do We Set <code>pad_token = eos_token</code>?</h3>

<table style="font-size: 16px;">
  <tr>
    <th>Why?</th>
    <th>Explanation</th>
  </tr>
  <tr>
    <td>No <code>pad_token</code> in TinyLLaMA</td>
    <td>Causal LLMs don’t use padding by default.</td>
  </tr>
  <tr>
    <td>Fixes tokenizer issues</td>
    <td>Prevents errors during batching and training.</td>
  </tr>
  <tr>
    <td>Standard trick</td>
    <td>It is safe and common to set <code>pad_token = eos_token</code> when fine-tuning causal models.</td>
  </tr>
</table>

In [None]:
tokenizer.pad_token = tokenizer.eos_token

<h2> Data Tokenization</h2>

<p style="font-size: 16px;">
In this step, we prepared the dataset for fine-tuning by converting <b>instruction-response pairs</b> into model-readable token sequences.
</p>

<h3 style="font-size: 18px;">🔧 Functions Used:</h3>

<ul style="font-size: 16px;">
  <li><b><code>format_instruction()</code>:</b> Converts each sample into a specific <b>instruction-response format</b> required by TinyLLaMA.</li>
</ul>

<pre style="font-size: 15px;">
### Instruction:
What is the market price of the 2019 Hyundai Santro?

### Response:
The 2019 Hyundai Santro has listed price: 300000.0.
</pre>

<ul style="font-size: 16px;">
  <li><b><code>tokenize_function()</code>:</b> Tokenizes each formatted text into input IDs and attention masks.</li>
  <li><b>Padding & Truncation:</b> Sequences are padded or truncated to a <b>max length of 250 tokens</b> to ensure consistent batch sizes.</li>
  <li><b>Labels:</b> For causal language modeling, <b>labels are set as a copy of input IDs</b> (this enables the model to predict the next token).</li>
</ul>

<h3 style="font-size: 18px;">🚀 Tokenization Workflow:</h3>

<ol style="font-size: 16px;">
  <li>Format the raw data into instruction-response pairs.</li>
  <li>Tokenize the pairs using the tokenizer.</li>
  <li>Set <code>labels = input_ids</code> for causal language modeling.</li>
  <li>Use <code>dataset.map()</code> to apply this function over the entire dataset efficiently.</li>
</ol>

<p style="font-size: 16px;">
This step ensures the data is properly tokenized and ready to be fed into the TinyLLaMA model for fine-tuning.
</p>


In [None]:
def format_instruction(sample):
    return f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['response']}"

def tokenize_function(batch):
    texts = [format_instruction({"instruction": i, "response": o}) for i, o in zip(batch["instruction"], batch["response"])]

    # Remove return_tensors="pt"
    tokenized = tokenizer(
        texts,
        max_length=250,
        padding="max_length",
        truncation=True,
    )

    tokenized["labels"] = tokenized["input_ids"].copy()  # Use copy for safety
    return tokenized

tokenized_dataset = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# tokenized_dataset


In [None]:
# tokenized_dataset['input_ids']

<h2> Fine-Tuning Configuration</h2>

<p style="font-size: 16px;">
In this step, we define the <b>training configuration</b> for fine-tuning TinyLLaMA using Hugging Face's <code>Trainer</code> API.
This setup controls how the model learns from the dataset.
</p>

<h3 style="font-size: 18px;"> TrainingArguments:</h3>

<ul style="font-size: 16px;">
  <li><b><code>output_dir</code>:</b> Directory where the fine-tuned model will be saved (<code>./tinyllama-finetuned</code>).</li>
  <li><b><code>per_device_train_batch_size</code>:</b> Batch size per device set to <b>1</b> due to memory constraints (useful for Colab).</li>
  <li><b><code>gradient_accumulation_steps</code>:</b> Accumulate gradients over <b>4 steps</b> to simulate a larger batch size without increasing memory usage.</li>
  <li><b><code>num_train_epochs</code>:</b> Train for <b>3 complete passes</b> over the dataset.</li>
  <li><b><code>save_strategy</code>:</b> Save the model at the <b>end of each epoch</b>.</li>
  <li><b><code>logging_dir</code>:</b> Directory for saving logs (<code>./logs</code>).</li>
  <li><b><code>logging_steps</code>:</b> Log training progress every <b>10 steps</b>.</li>
  <li><b><code>report_to</code>:</b> Set to <b>"none"</b> to disable tracking (can be set to <b>"wandb"</b> if using Weights & Biases).</li>
</ul>



<p style="font-size: 16px;">
These arguments control the <b>fine-tuning pipeline</b> and help manage training behavior, saving, and logging.
</p>


In [None]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# model = AutoModelForCausalLM.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir="./tinyllama-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    report_to="none"  # or "wandb" if you're tracking
)


In [None]:

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

<h2 style="font-size: 24px;"> Why LoRA?</h2>

<table style="font-size: 20px; width: 100%; border: 2px solid #666; border-collapse: collapse;">
  <tr style="background-color: #f2f2f2;">
    <th style="padding: 15px; border: 2px solid #666;">Benefit</th>
    <th style="padding: 15px; border: 2px solid #666;">Explanation</th>
  </tr>
  <tr>
    <td style="padding: 15px; border: 2px solid #666;">Reduces GPU usage</td>
    <td style="padding: 15px; border: 2px solid #666;">Only updates lightweight LoRA layers, not the full model.</td>
  </tr>
  <tr>
    <td style="padding: 15px; border: 2px solid #666;">Faster training</td>
    <td style="padding: 15px; border: 2px solid #666;">Uses less memory and compute, making training quicker.</td>
  </tr>
  <tr>
    <td style="padding: 15px; border: 2px solid #666;">Maintains quality</td>
    <td style="padding: 15px; border: 2px solid #666;">Keeps the base model knowledge intact while adapting to new tasks.</td>
  </tr>
</table>



In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Should show ~1% trainable params

trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    # eval_dataset=tokenized_datase
)

trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,12.3702
20,9.8572
30,5.6025
40,2.6639
50,1.5171
60,0.7516
70,0.5136
80,0.4481
90,0.4245
100,0.3571


TrainOutput(global_step=750, training_loss=0.5226639484167099, metrics={'train_runtime': 1089.7902, 'train_samples_per_second': 2.753, 'train_steps_per_second': 0.688, 'total_flos': 4660374528000000.0, 'train_loss': 0.5226639484167099, 'epoch': 3.0})

In [None]:
model.save_pretrained("tinyllama-alpaca")
tokenizer.save_pretrained("tinyllama-alpaca")


input_text = "### Instruction: What is the market price of the 2019 Hyundai Santro? \n\n### Response:"
inputs = tokenizer(input_text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=150  )  # Adjust max_new_tokens as needed
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

### Instruction: What is the market price of the 2019 Hyundai Santro? 

### Response:
The 2019 Hyundai Santro has listed price: 300000.0.


<h2 style="font-size: 24px;">✅ Final Project Statement</h2>

<p style="font-size: 18px;">
This project focused on <b>fine-tuning TinyLLaMA</b> for domain-specific <b>Question Answering (Q&A)</b> and <b>text generation</b> tasks. The goal was to adapt a pre-trained language model to handle real-world queries related to the used car domain while keeping the model efficient and resource-friendly.
</p>

<p style="font-size: 18px;">
To accomplish this, we used <b>Google Colab</b> as the development environment, which eliminated the need for local setup and made the entire pipeline accessible and reproducible. We adopted <b>4-bit quantization</b> (via BitsAndBytes) to significantly reduce memory usage and ensure smooth execution on limited GPU resources.
</p>

<p style="font-size: 18px;">
Additionally, <b>LoRA (Low-Rank Adaptation)</b> was implemented to fine-tune only selected layers of the model. This strategy reduced GPU load, accelerated the training process, and preserved the general knowledge of TinyLLaMA while specializing it on our custom dataset.
</p>

<p style="font-size: 18px;">
The dataset was prepared in <b>Instruction-Response</b> format, enabling the model to learn how to respond to specific car-related queries such as price, top speed, and technical specifications. After training, the model successfully generated accurate answers to unseen prompts, proving its ability to generalize from the fine-tuned data.
</p>

<p style="font-size: 18px;">
<b>In conclusion:</b> This project provides a complete pipeline for lightweight fine-tuning of LLMs like TinyLLaMA using LoRA and quantization. It shows how to create specialized models for practical applications while keeping training cost-effective and manageable on free-tier hardware like Google Colab.
</p>
